Save the date! Google I/O returns May 18-20 Register now


  • Description:

This dataset contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books project (, that were published before 1919. It also contains metadata of book titles and publication dates. PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark.

Books are partitioned into a train, validation, and test set. Books metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date, book_link).

Split Examples
'test' 100
'train' 28,602
'validation' 50
  • Features:
    'book_id': tf.int32,
    'book_link': tf.string,
    'book_text': Text(shape=(), dtype=tf.string),
    'book_title': tf.string,
    'publication_date': tf.string,
  • Citation:
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
          Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {},
year = {2019},