Loads the federated Stackoverflow dataset.
Downloads and caches the dataset locally. If previously downloaded, tries to load the dataset from cache.
This dataset is derived from the Stack Overflow Data hosted by kaggle.com and available to query through Kernels using the BigQuery API: https://www.kaggle.com/stackoverflow/stackoverflow. The Stack Overflow Data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
The data consists of the body text of all questions and answers. The bodies were parsed into sentences, and any user with fewer than 100 sentences was expunged from the data. Minimal preprocessing was performed as follows:
- Lowercase the text,
- Unescape HTML symbols,
- Remove non-ascii symbols,
- Separate punctuation as individual tokens (except apostrophes and hyphens),
- Removing extraneous whitespace,
- Replacing URLS with a special token.
In addition the following metadata is available:
- Creation date
- Question title
- Question tags
- Question score
- Type ('question' or 'answer')
The data is divided into three sets:
- Train: Data before 2018-01-01 UTC except the held-out users. 342,477 unique users with 135,818,730 examples.
- Held-out: All examples from users with user_id % 10 == 0 (all dates). 38,758 unique users with 16,491,230 examples.
- Test: All examples after 2018-01-01 UTC except from held-out users. 204,088 unique users with 16,586,035 examples.
tf.data.Datasets returned by
collections.OrderedDict objects at each iteration, with the
following keys and values:
dtype=tf.stringand shape  containing the date/time of the question or answer in UTC format.
dtype=tf.stringand shape  containing the title of the question.
dtype=tf.int64and shape  containing the score of the question.
dtype=tf.stringand shape  containing the tags of the question, separated by '|' characters.
dtype=tf.stringand shape  containing the tokens of the question/answer, separated by space (' ') characters.
dtype=tf.stringand shape  containing either the string 'question' or 'answer'.
cache_dir: (Optional) directory to cache the downloaded file. If
None, caches in Keras' default cache directory.
Tuple of (train, held_out, test) where the tuple elements are