ทาปาโก

อ้างอิง:

ทุก_ภาษา

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/all_languages')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 1926192
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

อัฟ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/af')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 307
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

อาร์

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/ar')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 6446
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

อาซ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/az')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 624
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เป็น

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/be')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 1512
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เบอร์

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/ber')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 67484
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

บีจี

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/bg')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 6324
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

พันล้าน

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/bn')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 1440
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

พี่ชาย

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/br')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 2536
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

แคลิฟอร์เนีย

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/ca')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 518
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ซีบีเค

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/cbk')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 262
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ซม

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/cmn')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 12549
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ซีเอส

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/cs')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 6659
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ดา

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/da')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 11220
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เดอ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/de')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 125091
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เอล

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/el')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 10072
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ห้องน้ำในตัว

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/en')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 158053
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

อีโอ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/eo')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 207105
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เช่น

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/es')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 85064
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

et

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/et')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 241
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

สหภาพยุโรป

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/eu')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 573
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ฟิ

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/fi')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 31753
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/fr')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 116733
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

GL

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/gl')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a
crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular
linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links
between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent
filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows
that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or
near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200  250 000
sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
  • ใบอนุญาต : Creative Commons Attribution 2.0 ทั่วไป
  • เวอร์ชัน : 1.0.0
  • แยก :
แยก ตัวอย่าง
'train' 351
  • คุณสมบัติ :
{
    "paraphrase_set_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "sentence_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "paraphrase": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "lists": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "tags": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "language": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

ไป

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:tapaco/gos')
  • คำอธิบาย :
A freely available paraphrase corpus for 73 languages extracted from the Tatoeba