Attend the Women in ML Symposium on December 7 Register now

Stay organized with collections Save and categorize content based on your preferences.

Manages the download and extraction of files, as well as caching.

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".

When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('')
test_dir = dl_manager.download_and_extract('')

# Parallel download: list -> list
image_files =
    ['', '', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': '',
   'test': '',

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a as argument.

download_dir Path to directory where downloads are stored.
extract_dir Path to directory where artifacts are extracted.
manual_dir Path to manually downloaded/extracted data directory.
manual_dir_instructions Human readable instructions on how to prepare contents of the manual_dir for this dataset.
url_infos Urls info for the checksums.
dataset_name Name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
force_download If True, always [re]download.
force_extraction If True, always [re]extract.
force_checksums_validation If True, raises an error if an URL do not have checksums.
register_checksums If True, dl checksums aren't checked, but stored into file.
register_checksums_path Path were to save checksums. Should be set if register_checksums is True.
verify_ssl bool, defaults to True. If True, will verify certificate when downloading dataset.

FileNotFoundError Raised if the register_checksums_path does not exist.


downloaded_size Returns the total size of downloaded files.
manual_dir Returns the directory containing the manually extracted data.
register_checksums Returns whether checksums are being computed and recorded to file.



View source

Download given url(s).

url_or_urls url or list/dict of urls to download and extract. Each url can be a str or

downloaded_path(s): str, The downloaded paths matching the given input url_or_urls.


View source

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(

url_or_urls url or list/dict of urls to download and extract. Each url can be a str or If not explicitly specified in Resource, the extraction method will automatically be deduced from downloaded file name.

extracted_path(s): str, extracted paths of given URL(s).


View source

Downloads checksum file from the given URL and adds it to registry.


View source

Download data for a given Kaggle Dataset or competition.

competition_or_dataset Dataset name (zillow/zecon) or competition name (titanic)

The path to the downloaded files.


View source

Extract given path(s).

path_or_paths path or list/dict of path of file to extract. Each path can be a str or If not explicitly specified in Resource, the extraction method is deduced from downloaded file name.

extracted_path(s): str, The extracted paths matching the given input path_or_paths.


View source

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

resource path to archive or

Generator yielding tuple (path_within_archive, file_obj).