reddit

TFDS hiện hỗ trợ định dạng Croissant 🥐 ! Đọc tài liệu để biết thêm.

Trang này được dịch bởi Cloud Translation API.

Mô tả :

Tập dữ liệu này chứa các bài đăng được xử lý trước từ bộ dữ liệu Reddit. Bộ dữ liệu bao gồm 3.848.330 bài đăng với độ dài trung bình là 270 từ cho nội dung và 28 từ cho phần tóm tắt.

Các tính năng bao gồm các chuỗi: tác giả, nội dung, cơ thể chuẩn hóa, nội dung, tóm tắt, subreddit, subreddit_id. Nội dung được sử dụng làm tài liệu và tóm tắt được sử dụng làm tóm tắt.

Tài liệu bổ sung : Khám phá trên giấy tờ với mã
Trang chủ : https://github.com/webis-de/webis-tldr-17-corpus
Mã nguồn : tfds.datasets.reddit.Builder
Phiên bản :
- 1.0.0 (mặc định): Không có ghi chú phát hành.
Kích thước tải xuống : 2.93 GiB
Kích thước tập dữ liệu : 18.09 GiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'train'`	3.848.330

Cấu trúc tính năng :

FeaturesDict({
    'author': string,
    'body': string,
    'content': string,
    'id': string,
    'normalizedBody': string,
    'subreddit': string,
    'subreddit_id': string,
    'summary': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Dtype
	Tính năngDict
tác giả	tenxơ	chuỗi
thân thể	tenxơ	chuỗi
Nội dung	tenxơ	chuỗi
Tôi	tenxơ	chuỗi
cơ thể bình thường	tenxơ	chuỗi
phụ bản	tenxơ	chuỗi
subreddit_id	tenxơ	chuỗi
tóm lược	tenxơ	chuỗi

Các khóa được giám sát (Xem as_supervised doc ): ('content', 'summary')
Hình ( tfds.show_examples ): Không được hỗ trợ.
Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{volske-etal-2017-tl,
    title = "{TL};{DR}: Mining {R}eddit to Learn Automatic Summarization",
    author = {V{\"o}lske, Michael  and
      Potthast, Martin  and
      Syed, Shahbaz  and
      Stein, Benno},
    booktitle = "Proceedings of the Workshop on New Frontiers in Summarization",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W17-4508",
    doi = "10.18653/v1/W17-4508",
    pages = "59--63",
    abstract = "Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a {``}TL;DR{''} to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 dataset, complementing existing corpora primarily from the news genre. Our technique is likely applicable to other social media sites and general web crawls.",
}

reddit Sử dụng bộ sưu tập để sắp xếp ngăn nắp các trang Lưu và phân loại nội dung dựa trên lựa chọn ưu tiên của bạn.