ML Community Day is November 9! Join us for updates from TensorFlow, JAX, and more Learn more

tfds.download.DownloadManager

Manages the download and extraction of files, as well as caching.

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".

When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('https://abc.org/train.tar.gz')
test_dir = dl_manager.download_and_extract('https://abc.org/test.tar.gz')

# Parallel download: list -> list
image_files = dl_manager.download(
    ['https://a.org/1.jpg', 'https://a.org/2.jpg', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': 'https://abc.org/train.zip',
   'test': 'https://abc.org/test.zip',
})
data_dirs['train']
data_dirs['test']

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a tfds.download.Resource as argument.

download_dir Path to directory where downloads are stored.
extract_dir Path to directory where artifacts are extracted.
manual_dir Path to manually downloaded/extracted data directory.
manual_dir_instructions Human readable instructions on how to prepare contents of the manual_dir for this dataset.
url_infos Urls info for the checksums.
dataset_name Name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
force_download If True, always [re]download.
force_extraction If True, always [re]extract.
force_checksums_validation If True, raises an error if an URL do not have checksums.
register_checksums If True, dl checksums aren't checked, but stored into file.
register_checksums_path Path were to save checksums. Should be set if register_checksums is True.
verify_ssl bool, defaults to True. If True, will verify certificate when downloading dataset.

FileNotFoundError Raised if the register_checksums_path does not exists.

download_dir

downloaded_size Returns the total size of downloaded files.
manual_dir Returns the directory containing the manually extracted data.
register_checksums Returns whether checksums are being computed and recorded to file.

Methods

download

View source

Download given url(s).

Args
url_or_urls url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource.

Returns
downloaded_path(s): str, The downloaded paths matching the given input url_or_urls.

download_and_extract

View source

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

Args
url_or_urls url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource. If not explicitly specified in Resource, the extraction method will automatically be deduced from downloaded file name.

Returns
extracted_path(s): str, extracted paths of given URL(s).

download_checksums

View source

Downloads checksum file from the given URL and adds it to registry.

download_kaggle_data

View source

Download data for a given Kaggle Dataset or competition.

Args
competition_or_dataset Dataset name (zillow/zecon) or competition name (titanic)

Returns
The path to the downloaded files.

extract

View source

Extract given path(s).

Args
path_or_paths path or list/dict of path of file to extract. Each path can be a str or tfds.download.Resource. If not explicitly specified in Resource, the extraction method is deduced from downloaded file name.

Returns
extracted_path(s): str, The extracted paths matching the given input path_or_paths.

iter_archive

View source

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

Args
resource path to archive or tfds.download.Resource.

Returns
Generator yielding tuple (path_within_archive, file_obj).