


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 130640
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 4518
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 79928
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 2025년
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 5343
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 27050
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 43102
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 9212
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 9985
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 307405
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 15762
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 36
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 26145
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 626796
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 98225
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 37
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1114481
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 702
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 2984
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 10130
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 80
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1172041
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 3398679
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1770년
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 2458067
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법률에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 68210
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법률에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해되었다고 주장하는 저작물을 명확하게 식별합니다.
    • 침해했다고 주장되는 자료와 해당 자료를 찾는 데 합리적으로 충분한 정보를 명확하게 식별합니다.

    우리는 다음 번 코퍼스 릴리스에서 영향을 받은 소스를 제거하여 합법적인 요청을 준수할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 9006977
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이선스 : 이 데이터는 이 라이선스 체계에 따라 공개됩니다. 우리는 이 데이터가 추출된 텍스트를 소유하지 않습니다. 당사는 Creative Commons CC0 라이선스("권리 보유 없음")에 따라 이러한 데이터의 실제 패키징에 대한 라이선스를 부여합니다. http://creativecommons.org/publicdomain/zero/1.0/ Inria는 법률에 따라 가능한 한 모든 저작권 및 관련 또는 관련 또는 OSCAR에 대한 저작인접권 이 저작물은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유한 자료가 포함되어 있으므로 여기에서 복제해서는 안 된다고 생각하시는 경우 다음을 수행해 주십시오.

    • 연락할 수 있는 주소, 전화번호, 이메일 주소 등 상세한 연락처 데이터를 사용하여 자신의 신원을 명확하게 밝히십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 360
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 4
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 82
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 14724
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 4771098
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 17024
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 84752
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 8203495
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 20661
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 68
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 12308039
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1909387
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 6582908
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 11
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 59448891
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 3883
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 169834
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 3084
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 529
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 617
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 617
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 108346
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 29054
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 18808
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 1374
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 843195
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 166
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 212556
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    당사의 데이터에 귀하가 소유 한 자료가 포함되어 있으므로 여기에서 재현되지 않아야한다고 생각하면 다음과 같습니다.

    • 연락 할 수있는 주소, 전화 번호 또는 이메일 주소와 같은 자세한 연락처 데이터를 사용하여 자신을 명확하게 식별하십시오.
    • 침해 된 것으로 주장 된 저작권이있는 저작물을 명확하게 식별하십시오.
    • 침해 중이라고 주장 된 자료와 정보를 우리가 자료를 찾을 수 있도록 합리적으로 충분한 정보를 식별하십시오.

    우리는 코퍼스의 다음 릴리스에서 영향을받는 출처를 제거하여 합법적 인 요청을 준수 할 것입니다.

  • 버전 : 1.0.0

  • 분할 :

'train' 7
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • 라이센스 :이 데이터는이 라이센스 체계에 따라 해제됩니다. 우리는 이러한 데이터가 추출 된 텍스트를 소유하지 않습니다. 우리는 Creative Commons CC0 라이센스 ( "권한 예약 없음")에 따라 이러한 데이터의 실제 포장을 라이센스합니다. http://creativecommons.org/publicdomain/zero/1.0/ 은 법률에 따라 가능한 한 모든 저작권 및 관련 OR을 면제했습니다. 오스카에 대한 이웃 권리이 작품은 프랑스에서 출판되었습니다.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 58
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2126
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6485
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 67921
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 28522082
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 372158
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5044757
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 17
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3675420
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 68
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1381
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 72
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 13343
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 453904
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 183443
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 8714
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 109118
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2559
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2859
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 411
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7121
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2820821
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 17610
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 42
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 645747
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 833101
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4694
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 24
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 15074
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 677
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2418
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 11014487
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 56259
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 62398034
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 11596446
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6521169
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7782375
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9897709
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 64
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 49
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7324
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 158113
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 912330
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1675515
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2143
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4042
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 20281
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 84
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2093621
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 41708901
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2449
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6999
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 42551
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5869686
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6046
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4390754
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 103639
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 56326016
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7664010
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 21018
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 121168
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5326443
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 46493
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 484
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 321484
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 396093
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1578년
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 13704702
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 33053
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 106
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3264660
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 11197780
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 101
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 39496439
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 338073
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1377
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 86561
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 118
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1737411
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2515
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 197878
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 16383
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 917
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 219334
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3229940
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 87235
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3463
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 34
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 8555
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 120684
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 461598
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 24803
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3749826
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 82738
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 428674
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3317
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 36
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 83663
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 14985
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 15446
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 586031
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 26795
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 42
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 56248
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 157698
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 65
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 96742378
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5799
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 240691
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7959
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1040
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 694
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 832
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 159363
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 46535
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 94588
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1401
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1593820
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 220
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 326804
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 8
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 61
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4696
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 10709
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 98216
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9387265
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 21
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5492194
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1013619
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1263280
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6456
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 34
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 27537
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1001
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3783
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 46981781
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 563916
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7345075
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 203
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1485
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 88
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 17957
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 603937
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 534016
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 18174
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 185884
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5213
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3225
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 452
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 14291
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 36700
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 156
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 17395625
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 89002
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 18535253
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 12973467
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 14898250
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 214
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 214
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 60137667
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 304230423
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 256513
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 284320
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2375030
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9948521
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 389515
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1163
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 251064
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 924
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 21735
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 32652
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 25
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 299457
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 669
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 136639
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 55
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 20812149
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 44230
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 20682611
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 26920397
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 115954598
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 33925
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 886223
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 511
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 312644
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 294132
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 15503
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 64
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9161
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 32919
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 201117
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 16365602
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 456
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 336
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 37085
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 21001388
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 104913504
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 10425596
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 88199221
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 8557453
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 83223
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 640
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 582219
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 659430
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2638
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 62721527
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 524591
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1581
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 146993
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 137
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 2977757
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3212
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 395605
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 26598
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1055
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 299938
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 5546211
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 127467
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 4599
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 41
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 22301
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 203082
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 672077
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 41986
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 6064129
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 135923
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 638596
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3366
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 39
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 11
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 455994980
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 506883
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 7
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 544388
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 3808397
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 13
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 16236463
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 625673
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1445
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 350363
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1549년
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 34807
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 52910
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 123
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 437871
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 757
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 232329
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 73
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 34682142
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 59463
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 35440972
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 42114520
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 161836003
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 44280
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 1746604
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 805
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 475703
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 458206
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 22255
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 73
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 9760
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


TFDS에 이 데이터세트를 로드하려면 다음 명령어를 사용하세요.

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • 설명 :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • 버전 : 1.0.0

  • 분할 :

'train' 59364
  • 특징 :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"