Announcing the TensorFlow Dev Summit 2020 Learn more

tfds.download.DownloadManager

View source on GitHub

Class DownloadManager

Manages the download and extraction of files, as well as caching.

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".

When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('https://abc.org/train.tar.gz')
test_dir = dl_manager.download_and_extract('https://abc.org/test.tar.gz')

# Parallel download: list -> list
image_files = dl_manager.download(
    ['https://a.org/1.jpg', 'https://a.org/2.jpg', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': 'https://abc.org/train.zip',
   'test': 'https://abc.org/test.zip',
})
data_dirs['train']
data_dirs['test']

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a tfds.download.Resource as argument.

__init__

View source

__init__(
    download_dir,
    extract_dir=None,
    manual_dir=None,
    manual_dir_instructions=None,
    dataset_name=None,
    force_download=False,
    force_extraction=False,
    register_checksums=False
)

Download manager constructor.

Args:

  • download_dir: str, path to directory where downloads are stored.
  • extract_dir: str, path to directory where artifacts are extracted.
  • manual_dir: str, path to manually downloaded/extracted data directory.
  • manual_dir_instructions: str, human readable instructions on how to prepare contents of the manual_dir for this dataset.
  • dataset_name: str, name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
  • force_download: bool, default to False. If True, always [re]download.
  • force_extraction: bool, default to False. If True, always [re]extract.
  • register_checksums: bool, default to False. If True, dl checksums aren't checked, but stored into file.

Properties

downloaded_size

Returns the total size of downloaded files.

manual_dir

Returns the directory containing the manually extracted data.

Methods

download

View source

download(url_or_urls)

Download given url(s).

Args:

  • url_or_urls: url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource.

Returns:

downloaded_path(s): str, The downloaded paths matching the given input url_or_urls.

download_and_extract

View source

download_and_extract(url_or_urls)

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

Args:

  • url_or_urls: url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource.

If not explicitly specified in Resource, the extraction method will automatically be deduced from downloaded file name.

Returns:

extracted_path(s): str, extracted paths of given URL(s).

download_checksums

View source

download_checksums(checksums_url)

Downloads checksum file from the given URL and adds it to registry.

download_kaggle_data

View source

download_kaggle_data(competition_name)

Download data for a given Kaggle competition.

extract

View source

extract(path_or_paths)

Extract given path(s).

Args:

  • path_or_paths: path or list/dict of path of file to extract. Each path can be a str or tfds.download.Resource.

If not explicitly specified in Resource, the extraction method is deduced from downloaded file name.

Returns:

extracted_path(s): str, The extracted paths matching the given input path_or_paths.

iter_archive

View source

iter_archive(resource)

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

Args:

Returns:

Generator yielding tuple (path_within_archive, file_obj).