อินดอนลู

อ้างอิง:

อีโมต

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/emot')
  • คำอธิบาย :
An emotion classification dataset collected from the social media
platform Twitter (Saputri et al., 2018). The dataset consists of
around 4000 Indonesian colloquial language tweets, covering five
different emotion labels: sadness, anger, love, fear, and happy.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 440
'train' 3521
'validation' 440
  • คุณสมบัติ :
{
    "tweet": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "num_classes": 5,
        "names": [
            "sadness",
            "anger",
            "love",
            "fear",
            "happy"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    }
}

SMSA

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/smsa')
  • คำอธิบาย :
This sentence-level sentiment analysis dataset (Purwarianti and Crisdayanti, 2019)
is a collection of comments and reviews in Indonesian obtained from multiple online
platforms. The text was crawled and then annotated by several Indonesian linguists
to construct this dataset. There are three possible sentiments on the SmSA
dataset: positive, negative, and neutral.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 500
'train' 11000
'validation' 1260
  • คุณสมบัติ :
{
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "num_classes": 3,
        "names": [
            "positive",
            "neutral",
            "negative"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    }
}

คาซ่า

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/casa')
  • คำอธิบาย :
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected
from multiple Indonesian online automobile platforms (Ilmania et al., 2018). The dataset covers
six aspects of car quality. We define the task to be a multi-label classification task, where
each label represents a sentiment for a single aspect with three possible values: positive,
negative, and neutral.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 180
'train' 810
'validation' 90
  • คุณสมบัติ :
{
    "sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "fuel": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "machine": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "others": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "part": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "price": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "service": {
        "num_classes": 3,
        "names": [
            "negative",
            "neutral",
            "positive"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    }
}

โฮซา

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/hoasa')
  • คำอธิบาย :
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel
aggregator platform, AiryRooms (Azhar et al., 2019). The dataset covers ten different aspects of
hotel quality. Each review is labeled with a single sentiment label for each aspect. There are
four possible sentiment classes for each sentiment label: positive, negative, neutral, and
positive-negative. The positivenegative label is given to a review that contains multiple sentiments
of the same aspect but for different objects (e.g., cleanliness of bed and toilet).
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 286
'train' 2283
'validation' 285
  • คุณสมบัติ :
{
    "sentence": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "ac": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "air_panas": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "bau": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "general": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "kebersihan": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "linen": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "service": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "sunrise_meal": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "tv": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    },
    "wifi": {
        "num_classes": 4,
        "names": [
            "neg",
            "neut",
            "pos",
            "neg_pos"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    }
}

เขียน

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/wrete')
  • คำอธิบาย :
The Wiki Revision Edits Textual Entailment dataset (Setya and Mahendra, 2018) consists of 450 sentence pairs
constructed from Wikipedia revision history. The dataset contains pairs of sentences and binary semantic
relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be
derived from the first one, and not entailed otherwise.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 100
'train' 300
'validation' 50
  • คุณสมบัติ :
{
    "premise": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "hypothesis": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "category": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "label": {
        "num_classes": 2,
        "names": [
            "NotEntail",
            "Entail_or_Paraphrase"
        ],
        "names_file": null,
        "id": null,
        "_type": "ClassLabel"
    }
}

โพสต์

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/posp')
  • คำอธิบาย :
This Indonesian part-of-speech tagging (POS) dataset (Hoesen and Purwarianti, 2018) is collected from Indonesian
news websites. The dataset consists of around 8000 sentences with 26 POS tags. The POS tag labels follow the
Indonesian Association of Computational Linguistics (INACL) POS Tagging Convention.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 840
'train' 6720
'validation' 840
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 26,
            "names": [
                "B-PPO",
                "B-KUA",
                "B-ADV",
                "B-PRN",
                "B-VBI",
                "B-PAR",
                "B-VBP",
                "B-NNP",
                "B-UNS",
                "B-VBT",
                "B-VBL",
                "B-NNO",
                "B-ADJ",
                "B-PRR",
                "B-PRK",
                "B-CCN",
                "B-$$$",
                "B-ADK",
                "B-ART",
                "B-CSN",
                "B-NUM",
                "B-SYM",
                "B-INT",
                "B-NEG",
                "B-PRI",
                "B-VBE"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

บาโป

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/bapos')
  • คำอธิบาย :
This POS tagging dataset (Dinakaramani et al., 2014) contains about 1000 sentences, collected from the PAN Localization
Project. In this dataset, each word is tagged by one of 23 POS tag classes. Data splitting used in this benchmark follows
the experimental setting used by Kurniawan and Aji (2018)
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 1,029
'train' 8000
'validation' 1,000
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "pos_tags": {
        "feature": {
            "num_classes": 41,
            "names": [
                "B-PR",
                "B-CD",
                "I-PR",
                "B-SYM",
                "B-JJ",
                "B-DT",
                "I-UH",
                "I-NND",
                "B-SC",
                "I-WH",
                "I-IN",
                "I-NNP",
                "I-VB",
                "B-IN",
                "B-NND",
                "I-CD",
                "I-JJ",
                "I-X",
                "B-OD",
                "B-RP",
                "B-RB",
                "B-NNP",
                "I-RB",
                "I-Z",
                "B-CC",
                "B-NEG",
                "B-VB",
                "B-NN",
                "B-MD",
                "B-UH",
                "I-NN",
                "B-PRP",
                "I-SC",
                "B-Z",
                "I-PRP",
                "I-OD",
                "I-SYM",
                "B-WH",
                "B-FW",
                "I-CC",
                "B-X"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

เทอม

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/terma')
  • คำอธิบาย :
This span-extraction dataset is collected from the hotel aggregator platform, AiryRooms (Septiandri and Sutiono, 2019;
Fernando et al., 2019). The dataset consists of thousands of hotel reviews, which each contain a span label for aspect
and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use
Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 1,000
'train' 3000
'validation' 1,000
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "seq_label": {
        "feature": {
            "num_classes": 5,
            "names": [
                "I-SENTIMENT",
                "O",
                "I-ASPECT",
                "B-SENTIMENT",
                "B-ASPECT"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

เก็บ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/keps')
  • คำอธิบาย :
This keyphrase extraction dataset (Mahfuzh et al., 2019) consists of text from Twitter discussing
banking products and services and is written in the Indonesian language. A phrase containing
important information is considered a keyphrase. Text may contain one or more keyphrases since
important phrases can be located at different positions. The dataset follows the IOB chunking format,
which represents the position of the keyphrase.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 247
'train' 800
'validation' 200
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "seq_label": {
        "feature": {
            "num_classes": 3,
            "names": [
                "O",
                "B",
                "I"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

เนอร์กริต

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/nergrit')
  • คำอธิบาย :
This NER dataset is taken from the Grit-ID repository, and the labels are spans in IOB chunking representation.
The dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and
ORGANIZATION (name of organization).
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 209
'train' 1672
'validation' 209
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 7,
            "names": [
                "I-PERSON",
                "B-ORGANISATION",
                "I-ORGANISATION",
                "B-PLACE",
                "I-PLACE",
                "O",
                "B-PERSON"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

เนิร์ป

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/nerp')
  • คำอธิบาย :
This NER dataset (Hoesen and Purwarianti, 2018) contains texts collected from several Indonesian news websites.
There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand),
EVT (name of the event), and FNB (name of food and beverage). The NERP dataset uses the IOB chunking format.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 840
'train' 6720
'validation' 840
  • คุณสมบัติ :
{
    "tokens": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "ner_tags": {
        "feature": {
            "num_classes": 11,
            "names": [
                "I-PPL",
                "B-EVT",
                "B-PLC",
                "I-IND",
                "B-IND",
                "B-FNB",
                "I-EVT",
                "B-PPL",
                "I-PLC",
                "O",
                "I-FNB"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

คำถามที่พบบ่อย

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:indonlu/facqa')
  • คำอธิบาย :
The goal of the FacQA dataset is to find the answer to a question from a provided short passage from
a news article (Purwarianti et al., 2007). Each row in the FacQA dataset consists of a question,
a short passage, and a label phrase, which can be found inside the corresponding short passage.
There are six categories of questions: date, location, name, organization, person, and quantitative.
  • ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'test' 311
'train' 2495
'validation' 311
  • คุณสมบัติ :
{
    "question": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "passage": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "seq_label": {
        "feature": {
            "num_classes": 3,
            "names": [
                "O",
                "B",
                "I"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}