אוסקר

הפניות:

unshuffled_deduplicated_af

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 130640
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4518
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 79928
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2025
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5343
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 27050
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 43102
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 9212
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 9985
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 307405
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 15762
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 36
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 26145
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 626796
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 98225
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 37
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1114481
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 702
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2984
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 10130
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 80
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1172041
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3398679
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1770
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2458067
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 68210
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 9006977
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.

    אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 360
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bar

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_bh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 82
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_br

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 14724
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_cbk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_da

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4771098
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_dv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 17024
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_eo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 84752
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_fa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 8203495
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_fy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 20661
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_gn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 68
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_cs

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 12308039
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_hi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1909387
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_hu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 6582908
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ie

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 11
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_fr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 59448891
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_gd

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3883
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_gu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 169834
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_hsb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3084
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ia

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 529
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_io

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 617
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_jbo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 617
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_km

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 108346
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_ku

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 29054
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_la

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 18808
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_lmo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1374
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_lv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 843195
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_min

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 166
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_mr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 212556
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_mwl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:

    • זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
    • זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
    • זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.

    אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_dedupliced_nah

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, ויתרה אינריה את כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 58
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2126
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 6485
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 67921
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 28522082
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 372158
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5044757
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 17
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3675420
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 68
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1381
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 72
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 13343
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 453904
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 183443
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 8714
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 109118
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2559
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2859
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 411
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7121
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2820821
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 17610
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 42
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 645747
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 833101
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4694
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 24
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 15074
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 677
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2418
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 11014487
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 56259
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 62398034
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 11596446
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 6521169
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7782375
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 9897709
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 64
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 49
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7324
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 158113
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 912330
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1675515
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2143
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4042
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 20281
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 84
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2093621
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 41708901
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2449
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 6999
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 42551
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5869686
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 6046
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4390754
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 103639
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 56326016
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7664010
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 21018
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 121168
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5326443
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 46493
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 484
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 321484
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 396093
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1578
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 13704702
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 33053
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 106
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3264660
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 11197780
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 101
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 39496439
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 338073
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1377
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 86561
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 118
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 1737411
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2515
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 197878
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 16383
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 917
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 219334
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3229940
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 87235
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3463
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 34
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 8555
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 120684
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 461598
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 24803
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 3749826
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 82738
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 428674
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3317
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 36
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 7
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 83663
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 14985
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 15446
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 586031
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 26795
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 42
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 56248
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 157698
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 65
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 96742378
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 5799
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 240691
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 7959
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1040
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 694
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 832
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 159363
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 46535
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 94588
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1401
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1593820
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 220
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 326804
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 8
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 61
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 4696
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 10709
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 98216
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 9387265
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 21
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 5492194
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1013619
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1263280
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 6456
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 34
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 27537
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1001
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3783
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 46981781
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 563916
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 7345075
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 203
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1485
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 88
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 17957
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 603937
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 534016
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 6
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 18174
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 185884
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 5213
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3225
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 452
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 14291
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 36700
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 156
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 17395625
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 89002
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 18535253
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 12973467
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 14898250
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 214
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 214
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 60137667
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 304230423
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 256513
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 7
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 284320
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 2375030
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 9
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 9948521
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 389515
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1163
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 251064
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 924
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 21735
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 32652
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 25
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 299457
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 669
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 136639
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 55
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 20812149
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 44230
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 20682611
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 26920397
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 115954598
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 33925
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 886223
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 511
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 312644
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 294132
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 15503
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 64
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 9161
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 32919
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 201117
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 16365602
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 456
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 4
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 336
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 37085
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 21001388
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 104913504
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 10425596
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 88199221
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 8557453
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 83223
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 640
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 582219
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 659430
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 2638
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 62721527
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 524591
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1581
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 146993
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 137
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 2977757
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3212
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 395605
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 26598
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1055
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 299938
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 5546211
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 127467
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 4599
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 41
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 22301
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 203082
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 672077
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 41986
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 6064129
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 135923
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 638596
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3366
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 39
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 11
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 455994980
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 506883
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 7
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 544388
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 3808397
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 13
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 16236463
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 625673
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1445
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 350363
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1549
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 34807
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 52910
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 123
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 437871
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 757
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל דוגמאות
'train' 232329
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 73
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 34682142
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 59463
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 35440972
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 42114520
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 161836003
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 44280
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 1746604
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 805
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 475703
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 458206
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 22255
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 73
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 9760
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • גרסה : 1.0.0

  • פיצולים :

לְפַצֵל Examples
'train' 59364
  • מאפיינים :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}