reddit

टीएफडीएस अब क्रोइसैन 🥐 प्रारूप का समर्थन करता है! अधिक जानने के लिए दस्तावेज़ पढ़ें.

इस पेज का अनुवाद Cloud Translation API से किया गया है.

विवरण :

इस कॉर्पस में Reddit डेटासेट से प्रीप्रोसेस्ड पोस्ट हैं। डेटासेट में सामग्री के लिए 270 शब्दों की औसत लंबाई और सारांश के लिए 28 शब्दों के साथ 3,848,330 पोस्ट होते हैं।

सुविधाओं में तार शामिल हैं: लेखक, शरीर, सामान्यीकृत शरीर, सामग्री, सारांश, सबरेडिट, सबरेडिट_आईडी। सामग्री का उपयोग दस्तावेज़ के रूप में किया जाता है और सारांश का उपयोग सारांश के रूप में किया जाता है।

अतिरिक्त दस्तावेज़ीकरण : कोड वाले पेपर्स पर एक्सप्लोर करें
होमपेज : https://github.com/webis-de/webis-tldr-17-corpus
स्रोत कोड : tfds.datasets.reddit.Builder
संस्करण :
- 1.0.0 (डिफ़ॉल्ट): कोई रिलीज़ नोट नहीं।
डाउनलोड का आकार : 2.93 GiB
डेटासेट का आकार : 18.09 GiB
ऑटो-कैश्ड ( दस्तावेज़ीकरण ): नहीं
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3,848,330

फ़ीचर संरचना :

FeaturesDict({
    'author': string,
    'body': string,
    'content': string,
    'id': string,
    'normalizedBody': string,
    'subreddit': string,
    'subreddit_id': string,
    'summary': string,
})

फ़ीचर दस्तावेज़ीकरण :

विशेषता	कक्षा	डीटाइप
	विशेषताएं डिक्ट
लेखक	टेन्सर	डोरी
तन	टेन्सर	डोरी
विषय	टेन्सर	डोरी
पहचान	टेन्सर	डोरी
सामान्यीकृत शरीर	टेन्सर	डोरी
उपखंड	टेन्सर	डोरी
subreddit_id	टेन्सर	डोरी
सारांश	टेन्सर	डोरी

पर्यवेक्षित कुंजी ( as_supervised दस्तावेज़ देखें): ('content', 'summary')
चित्र ( tfds.show_examples ): समर्थित नहीं है।
उदाहरण ( tfds.as_dataframe ):

उद्धरण :

@inproceedings{volske-etal-2017-tl,
    title = "{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization",
    author = {V{\"o}lske, Michael  and
      Potthast, Martin  and
      Syed, Shahbaz  and
      Stein, Benno},
    booktitle = "Proceedings of the Workshop on New Frontiers in Summarization",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W17-4508",
    doi = "10.18653/v1/W17-4508",
    pages = "59--63",
    abstract = "Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.",
}

reddit संग्रह की मदद से व्यवस्थित रहें अपनी प्राथमिकताओं के आधार पर, कॉन्टेंट को सेव करें और कैटगरी में बांटें.