oscar

Les références:

unshuffled_deduplicated_af

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 130640
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4518
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 79928
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2025
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5343
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 27050
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 43102
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9212
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9985
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 307405
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15762
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26145
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 626796
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 98225
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 37
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1114481
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 702
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2984
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10130
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 80
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1172041
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3398679
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1770
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2458067
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68210
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement l’œuvre protégée par le droit d’auteur qui aurait été violée.
    • Identifiez clairement le matériel prétendument en infraction et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous répondrons aux demandes légitimes en supprimant les sources concernées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9006977
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées sous ce régime de licence. Nous ne possédons aucun des textes à partir desquels ces données ont été extraites. Nous accordons une licence pour le packaging lui-même de ces données sous la licence Creative Commons CC0 (« aucun droit réservé ») http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible par la loi, Inria a renoncé à tout droit d'auteur et droits d'auteur ou connexes. droits voisins d'OSCAR Cet ouvrage est publié à partir de : France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne doivent donc pas être reproduites ici, veuillez :

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail auxquelles vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 360
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_bar

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_bh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 82
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_br

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14724
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_cbk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_da

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4771098
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_dv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17024
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_eo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 84752
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_fa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8203495
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_fy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20661
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_gn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_cs

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 12308039
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_hi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1909387
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_hu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6582908
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_ie

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_fr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59448891
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_gd

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3883
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_gu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 169834
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_hsb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3084
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_ia

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 529
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_io

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 617
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_jbo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 617
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_km

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 108346
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_ku

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 29054
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_la

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18808
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_lmo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1374
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_lv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 843195
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_Dedupliated_min

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 166
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliqué_mr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 212556
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_mwl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Si vous considérez que nos données contiennent du matériel qui vous appartient et ne devrait donc pas être reproduit ici, s'il vous plaît:

    • Identifiez-vous clairement, avec des données de contact détaillées telles qu'une adresse, un numéro de téléphone ou une adresse e-mail à laquelle vous pouvez être contacté.
    • Identifiez clairement le travail protégé par le droit d'auteur prétendu.
    • Identifiez clairement le matériel qui est censé être infiltré et les informations raisonnablement suffisantes pour nous permettre de localiser le matériel.

    Nous nous conformerons aux demandes légitimes en supprimant les sources affectées de la prochaine version du corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

non taillé_dedupliated_nah

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licence : Ces données sont publiées dans le cadre de ce schéma de licence, nous ne possédons aucun texte à partir duquel ces données ont été extraites. Nous concédons l'emballage réel de ces données sous la licence Creative Commons CC0 ("Pas de droits réservés") http://creativecommons.org/publicdomain/zero/1.0/ Dans la mesure du possible en vertu de la loi, INRIA a renoncé à tous Droits voisins à Oscar Ce travail est publié à partir de: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 58
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2126
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6485
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 67921
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 28522082
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 372158
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5044757
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3675420
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 68
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1381
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 72
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13343
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 453904
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 183443
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8714
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 109118
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2559
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2859
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 411
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7121
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2820821
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17610
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 645747
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 833101
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4694
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 24
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15074
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 677
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2418
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11014487
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56259
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 62398034
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11596446
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6521169
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7782375
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9897709
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 64
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 49
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7324
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 158113
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 912330
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1675515
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2143
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4042
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20281
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 84
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2093621
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41708901
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2449
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6999
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42551
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5869686
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6046
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4390754
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 103639
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56326016
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7664010
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21018
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 121168
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5326443
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46493
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 484
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 321484
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 396093
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1578
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13704702
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 33053
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 106
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3264660
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11197780
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 101
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 39496439
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 338073
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1377
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 86561
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 118
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1737411
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2515
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 197878
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16383
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 917
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 219334
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3229940
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 87235
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3463
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8555
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 120684
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 461598
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 24803
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3749826
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 82738
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 428674
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3317
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 83663
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14985
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15446
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 586031
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26795
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 56248
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 157698
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 65
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 96742378
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5799
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 240691
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7959
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1040
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 694
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 832
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 159363
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46535
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 94588
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1401
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1593820
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 220
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 326804
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 61
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4696
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10709
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 98216
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9387265
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5492194
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1013619
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1263280
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6456
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 27537
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1001
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3783
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 46981781
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 563916
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7345075
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 203
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1485
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 88
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17957
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 603937
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 534016
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18174
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 185884
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5213
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3225
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 452
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14291
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 36700
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 156
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 17395625
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 89002
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 18535253
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 12973467
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 14898250
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 214
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 214
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 60137667
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 304230423
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 256513
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 284320
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2375030
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9948521
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 389515
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1163
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 251064
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 924
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21735
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 32652
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 25
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 299457
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 669
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 136639
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 55
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20812149
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 44230
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 20682611
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26920397
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 115954598
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 33925
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 886223
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 511
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 312644
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 294132
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 15503
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 64
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9161
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 32919
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 201117
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16365602
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 456
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 336
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 37085
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 21001388
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 104913504
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 10425596
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 88199221
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 8557453
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 83223
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 640
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 582219
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 659430
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2638
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 62721527
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 524591
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1581
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 146993
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 137
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 2977757
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3212
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 395605
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 26598
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1055
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 299938
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 5546211
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 127467
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 4599
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 22301
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 203082
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 672077
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 41986
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 6064129
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 135923
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 638596
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3366
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 39
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 11
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 455994980
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 506883
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 7
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 544388
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 3808397
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 13
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 16236463
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 625673
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1445
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 350363
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1549
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34807
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 52910
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 123
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 437871
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 757
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 232329
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 73
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 34682142
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59463
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 35440972
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 42114520
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 161836003
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 44280
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 1746604
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 805
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 475703
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 458206
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 22255
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 73
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 9760
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

Utilisez la commande suivante pour charger cet ensemble de données dans TFDS :

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisions :

Diviser Exemples
'train' 59364
  • Caractéristiques :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}