Oscar

Riferimenti:

unshuffled_deduplicated_af

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 130640
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4518
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 79928
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2025
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 5343
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 27050
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 43102
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 9212
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 9985
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 307405
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 15762
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 36
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 26145
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 626796
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 98225
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 37
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1114481
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 702
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2984
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 10130
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 80
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1172041
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 3398679
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1770
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2458067
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 68210
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 9006977
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.

    Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:

    • Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
    • Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 360
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_BAR

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_BH

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 82
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_BR

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 14724
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_CBK

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_DA

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4771098
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_DV

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 17024
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_EO

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 84752
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_FA

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 8203495
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_FY

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 20661
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_GN

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 68
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUffled_Deduplicated_cs

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 12308039
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_HI

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1909387
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUffled_Deduplicated_hu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 6582908
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUffled_Deduplicated_ie

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 11
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_FR

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 59448891
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_GD

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 3883
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_GU

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 169834
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_HSB

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 3084
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_IA

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 529
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_IO

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 617
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_JBO

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 617
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_KM

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 108346
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_KU

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 29054
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_LA

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 18808
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUffled_Deduplicated_lmo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1374
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_LV

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 843195
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unshuffled_deduplicated_min

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 166
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_MR

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 212556
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUffled_Deduplicated_mwl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:

    • Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
    • Identificare chiaramente il lavoro protetto da copyright sostenuto.
    • Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.

    Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 7
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

UNSHUFFLED_DEDUPLICATO_NAH

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 58
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2126
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 6485
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 67921
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 28522082
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 372158
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 5044757
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 17
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 3675420
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 68
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1381
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 72
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 13343
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 453904
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 183443
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 5
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 8714
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 109118
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2559
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2859
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 411
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 7121
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2820821
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 17610
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 42
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 645747
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 833101
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4694
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 24
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 15074
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 677
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2418
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 11014487
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 56259
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 62398034
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 11596446
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 6521169
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 7782375
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 9897709
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 64
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 49
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 7324
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 158113
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 912330
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1675515
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2143
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4042
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 20281
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 84
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2093621
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 41708901
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 2449
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 6999
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 42551
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 5869686
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 6046
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 4390754
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 103639
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 56326016
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 7664010
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 21018
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 121168
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 5326443
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 46493
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 484
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 321484
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 396093
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1578
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 13704702
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 33053
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 106
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 3264660
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 11197780
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 101
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 39496439
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 338073
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1377
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 86561
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 118
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1737411
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 2515
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 197878
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 16383
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 917
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 219334
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3229940
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 87235
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3463
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 34
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 8555
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 120684
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 461598
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 24803
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3749826
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 82738
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 428674
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3317
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 36
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 7
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 83663
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 14985
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 15446
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 586031
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 26795
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 42
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 56248
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 157698
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 65
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 96742378
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 5799
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 240691
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 7959
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1040
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 694
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 832
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 159363
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 46535
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 94588
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1401
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1593820
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 220
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 326804
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 8
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 61
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 4696
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 10709
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 98216
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 9387265
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 21
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 5492194
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1013619
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1263280
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 6456
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 34
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 27537
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1001
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3783
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 46981781
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 563916
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 7345075
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 203
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1485
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 88
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 17957
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 603937
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 534016
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 6
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 18174
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 185884
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 5213
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3225
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 452
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 14291
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 36700
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 156
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 17395625
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 89002
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 18535253
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 12973467
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 14898250
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 214
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 214
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 60137667
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 304230423
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 256513
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 7
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 284320
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 2375030
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 9
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 9948521
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 389515
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1163
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 251064
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 924
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 21735
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 32652
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 25
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 299457
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 669
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 136639
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 55
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 20812149
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 44230
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 20682611
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 26920397
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 115954598
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 33925
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 886223
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 511
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 312644
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 294132
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 15503
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 64
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 9161
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 32919
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 201117
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 16365602
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 456
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 4
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 336
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 37085
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 21001388
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 104913504
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 10425596
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 88199221
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 8557453
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 83223
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 640
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 582219
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 659430
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 2638
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 62721527
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 524591
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 1581
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 146993
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 137
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 2977757
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3212
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 395605
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 26598
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1055
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 299938
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 5546211
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 127467
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 4599
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 41
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 22301
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 203082
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 672077
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 41986
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 6064129
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 135923
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 638596
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3366
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 39
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 11
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 455994980
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 506883
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 7
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 544388
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 3808397
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 13
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 16236463
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 625673
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1445
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Esempi
'train' 350363
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1549
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 34807
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 52910
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 123
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 437871
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 757
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 232329
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 73
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 34682142
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 59463
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 35440972
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 42114520
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 161836003
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 44280
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 1746604
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 805
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 475703
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 458206
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 22255
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 73
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 9760
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

Utilizzare il comando seguente per caricare questo set di dati in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versione : 1.0.0

  • Divide :

Diviso Examples
'train' 59364
  • Caratteristiche :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}