
Les références:


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 130640
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4518
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 79928
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2025
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5343
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 27050
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 43102
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9212
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9985
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 307405
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15762
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26145
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 626796
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 98225
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 37
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1114481
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 702
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2984
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10130
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 80
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1172041
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3398679
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1770
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2458067
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68210
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9006977
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 360
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_bar

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_bh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 82
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_br

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14724
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_cbk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_da

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4771098
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_dv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17024
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_eo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 84752
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_fa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8203495
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_fy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20661
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_gn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_cs

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 12308039
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_hi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1909387
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_hu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6582908
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_ie

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_fr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59448891
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_gd

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3883
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_gu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 169834
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_hsb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3084
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_ia

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 529
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_io

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 617
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_jbo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 617
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_km

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 108346
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_ku

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 29054
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_la

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18808
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_lmo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1374
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_lv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 843195
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_Dedupliated_min

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 166
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliqué_mr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 212556
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_mwl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"

non taillé_dedupliated_nah

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 58
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2126
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6485
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 67921
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 28522082
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 372158
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5044757
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3675420
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1381
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 72
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13343
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 453904
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 183443
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8714
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 109118
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2559
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2859
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 411
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7121
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2820821
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17610
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 645747
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 833101
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4694
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 24
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15074
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 677
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2418
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11014487
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56259
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 62398034
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11596446
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6521169
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7782375
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9897709
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 64
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 49
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7324
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 158113
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 912330
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1675515
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2143
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4042
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20281
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 84
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2093621
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41708901
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2449
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6999
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42551
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5869686
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6046
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4390754
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 103639
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56326016
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7664010
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21018
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 121168
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5326443
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46493
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 484
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 321484
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 396093
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1578
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13704702
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 33053
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 106
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3264660
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11197780
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 101
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 39496439
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 338073
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1377
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 86561
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 118
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1737411
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2515
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 197878
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16383
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 917
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 219334
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3229940
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 87235
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3463
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8555
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 120684
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 461598
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 24803
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3749826
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 82738
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 428674
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3317
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 83663
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14985
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15446
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 586031
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26795
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56248
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 157698
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 65
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 96742378
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5799
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 240691
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7959
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1040
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 694
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 832
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 159363
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46535
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 94588
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1401
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1593820
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 220
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 326804
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 61
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4696
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10709
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 98216
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9387265
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5492194
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1013619
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1263280
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6456
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 27537
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1001
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3783
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46981781
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 563916
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7345075
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 203
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1485
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 88
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17957
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 603937
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 534016
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18174
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 185884
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5213
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3225
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 452
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14291
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36700
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 156
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17395625
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 89002
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18535253
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 12973467
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14898250
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 214
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 214
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 60137667
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 304230423
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 256513
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 284320
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2375030
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9948521
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 389515
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1163
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 251064
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 924
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21735
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 32652
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 25
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 299457
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 669
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 136639
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 55
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20812149
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 44230
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20682611
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26920397
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 115954598
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 33925
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 886223
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 511
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 312644
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 294132
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15503
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 64
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9161
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 32919
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 201117
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16365602
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 456
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 336
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 37085
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21001388
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 104913504
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10425596
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 88199221
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8557453
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 83223
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 640
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 582219
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 659430
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2638
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 62721527
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 524591
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1581
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 146993
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 137
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2977757
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3212
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 395605
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26598
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1055
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 299938
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5546211
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 127467
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4599
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 22301
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 203082
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 672077
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41986
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6064129
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 135923
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 638596
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3366
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 39
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 455994980
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 506883
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 544388
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3808397
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16236463
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 625673
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1445
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 350363
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1549
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34807
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 52910
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 123
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 437871
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 757
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 232329
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 73
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34682142
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59463
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 35440972
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42114520
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 161836003
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 44280
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1746604
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 805
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 475703
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 458206
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 22255
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 73
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9760
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59364
  • Caractéristiques :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"