conll2002

  • Description:

The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.

@inproceedings{tjong-kim-sang-2002-introduction,
    title = "Introduction to the {C}o{NLL}-2002 Shared Task: Language-Independent Named Entity Recognition",
    author = "Tjong Kim Sang, Erik F.",
    booktitle = "{COLING}-02: The 6th Conference on Natural Language Learning 2002 ({C}o{NLL}-2002)",
    year = "2002",
    url = "https://aclanthology.org/W02-2024",
}

conll2002/es (default config)

  • Download size: 3.95 MiB

  • Dataset size: 3.52 MiB

  • Splits:

Split Examples
'dev' 1,916
'test' 1,518
'train' 8,324
  • Feature structure:
FeaturesDict({
    'ner': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=9)),
    'pos': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=60)),
    'tokens': Sequence(Text(shape=(), dtype=string)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
ner Sequence(ClassLabel) (None,) int64
pos Sequence(ClassLabel) (None,) int64
tokens Sequence(Text) (None,) string

conll2002/nl

  • Download size: 3.47 MiB

  • Dataset size: 3.55 MiB

  • Splits:

Split Examples
'dev' 2,896
'test' 5,196
'train' 15,807
  • Feature structure:
FeaturesDict({
    'ner': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=9)),
    'pos': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=12)),
    'tokens': Sequence(Text(shape=(), dtype=string)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
ner Sequence(ClassLabel) (None,) int64
pos Sequence(ClassLabel) (None,) int64
tokens Sequence(Text) (None,) string