Stay organized with collections Save and categorize content based on your preferences.

Manages the download and extraction of files, as well as caching.

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".

When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('')
test_dir = dl_manager.download_and_extract('')

# Parallel download: list -> list
image_files =
    ['', '', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': '',
   'test': '',

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a as argument.

download_dir Path to directory where downloads are stored.
extract_dir Path to directory where artifacts are extracted.
manual_dir Path to manually downloaded/extracted data directory.
manual_dir_instructions Human readable instructions on how to prepare contents of the manual_dir for this dataset.
url_infos Urls info for the checksums.
dataset_name Name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
force_download If True, always [re]download.
force_extraction If True, always [re]extract.
force_checksums_validation If True, raises an error if an URL do not have checksums.
register_checksums If True, dl checksums aren't checked, but stored into file.
register_checksums_path Path were to save checksums. Should be set if register_checksums is True.
verify_ssl bool, defaults to True. If True, will verify certificate when downloading dataset.
max_simultaneous_downloads int, optional max number of simultaneous downloads.

FileNotFoundError Raised if the register_checksums_path does not exist.


downloaded_size Returns the total size of downloaded files.
manual_dir Returns the directory containing the manually extracted data.
register_checksums Returns whether checksums are being computed and recorded to file.



View source

Download given url(s).

url_or_urls url or list/dict of urls to download and extract. Each url can be a str or

downloaded_path(s): str, The downloaded paths matching the given input url_or_urls.


View source

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(

url_or_urls url or list/dict of urls to download and extract. Each url can be a str or If not explicitly specified in Resource, the extraction method will automatically be deduced from downloaded file name.

extracted_path(s): str, extracted paths of given URL(s).


View source

Downloads checksum file from the given URL and adds it to registry.


View source

Download data for a given Kaggle Dataset or competition.

competition_or_dataset Dataset name (zillow/zecon) or competition name (titanic)

The path to the downloaded files.


View source

Extract given path(s).

path_or_paths path or list/dict of path of file to extract. Each path can be a str or If not explicitly specified in Resource, the extraction method is deduced from downloaded file name.

extracted_path(s): str, The extracted paths matching the given input path_or_paths.


View source

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

resource path to archive or

Generator yielding tuple (path_within_archive, file_obj).