Referensi:
tidak diacak_deduplikasi_af
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 130640 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_als
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4518 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_arz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 79928 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_an
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2025 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_ast
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5343 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ba
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 27050 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_am
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 43102 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_as
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9212 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_azb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9985 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_diduplikasi_menjadi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 307405 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15762 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_bxr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 36 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ceb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26145 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_az
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 626796 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bcl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_cy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 98225 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_dsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 37 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1114481 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_bs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 702 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ce
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2984 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_cv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 10130 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_diq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_eml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 80 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_et
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1172041 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_bg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3398679 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak diacak_deduplikasi_bpy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1770 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplikasi_ca
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2458067 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ckb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68210 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_ar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi dengan jelas materi yang diklaim melanggar dan informasi cukup memadai untuk memungkinkan kami menemukan materi tersebut.
Kami akan memenuhi permintaan yang sah dengan menghapus sumber yang terpengaruh dari rilis korpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9006977 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
tidak dikocok_deduplikasi_av
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis berdasarkan skema lisensi ini. Kami tidak memiliki teks apa pun yang menjadi sumber pengambilan data ini. Kami melisensikan pengemasan sebenarnya dari data ini di bawah lisensi Creative Commons CC0 ("tidak ada hak yang dilindungi undang-undang") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh memungkinkan berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau hak tetangga atas OSCAR Karya ini diterbitkan dari: Perancis.
Jika Anda menganggap bahwa data kami berisi materi milik Anda dan oleh karena itu tidak boleh direproduksi di sini, mohon:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon, atau alamat email yang dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim telah dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 360 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bar
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 82 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_br
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14724 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cbk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_da
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4771098 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17024 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 84752 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8203495 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20661 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 12308039 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1909387 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6582908 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ie
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 59448891 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gd
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3883 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 169834 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hsb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3084 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ia
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 529 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_io
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 617 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jbo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 617 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_km
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 108346 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ku
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 29054 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_la
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 18808 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lmo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1374 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 843195 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_min
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 166 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 212556 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mwl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Identifikasi diri Anda dengan jelas, dengan data kontak terperinci seperti alamat, nomor telepon atau alamat email di mana Anda dapat dihubungi.
- Identifikasi dengan jelas karya berhak cipta yang diklaim dilanggar.
- Identifikasi materi yang diklaim secara jelas melanggar dan informasi yang cukup memadai untuk memungkinkan kita menemukan materi.
Kami akan mematuhi permintaan yang sah dengan menghapus sumber yang terkena dampak dari rilis corpus berikutnya.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nah
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Lisensi : Data ini dirilis dalam skema lisensi ini, kami tidak memiliki teks apa pun dari mana data ini telah diekstraksi. Kami melisensikan pengemasan aktual data ini di bawah lisensi Creative Commons CC0 ("No Rights Reserved") http://creativecommons.org/publicdomain/zero/1.0/ Sejauh mungkin berdasarkan hukum, Inria telah melepaskan semua hak cipta dan terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau atau terkait atau terkait atau atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait atau terkait dengan hukum semua Hak tetangga untuk Oscar karya ini diterbitkan dari: Prancis.
Jika Anda mempertimbangkan bahwa data kami berisi materi yang dimiliki oleh Anda dan karenanya tidak boleh direproduksi di sini, tolong:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 58 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2126 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6485 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 67921 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 28522082 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 372158 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5044757 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3675420 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 68 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1381 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 72 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13343 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 453904 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 183443 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8714 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 109118 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2559 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2859 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 411 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7121 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2820821 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 17610 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 645747 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 833101 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4694 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 24 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15074 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 677 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2418 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11014487 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56259 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 62398034 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11596446 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6521169 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7782375 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9897709 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 64 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 49 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7324 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 158113 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 912330 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1675515 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2143 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4042 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 20281 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 84 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2093621 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 41708901 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2449 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6999 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42551 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5869686 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6046 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4390754 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 103639 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56326016 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7664010 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21018 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 121168 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5326443 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 46493 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 484 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 321484 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 396093 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1578 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13704702 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 33053 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 106 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3264660 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 11197780 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 101 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 39496439 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 338073 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1377 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 86561 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 118 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1737411 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 2515 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 197878 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 16383 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 917 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 219334 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3229940 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 87235 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3463 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8555 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 120684 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 461598 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 24803 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3749826 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 82738 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 428674 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3317 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 36 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 83663 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 14985 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 15446 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 586031 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 26795 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 42 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 56248 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 157698 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 65 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 96742378 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5799 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 240691 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 7959 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1040 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Gunakan perintah berikut untuk memuat kumpulan data ini di TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 694 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 832 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 159363 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 46535 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 94588 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1401 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1593820 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 220 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 326804 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 8 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 61 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4696 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versi : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 10709 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 98216 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 9387265 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 21 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 5492194 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1013619 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1263280 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 6456 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 34 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 27537 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1001 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 3783 |
- Fitur :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Keterangan :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, pl