Apply to speak at TensorFlow World. Deadline April 23rd. Propose talk

tfds.download.DownloadManager

Class DownloadManager

Defined in core/download/download_manager.py.

Manages the download and extraction of files, as well as caching.

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "${sanitized_url}${content_checksum}.${ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "%{sanitized_url}${url_checksum}.tmp.${uuid}".

When a file is downloaded, a "%{fname}s.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "${extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('https://abc.org/train.tar.gz')
test_dir = dl_manager.download_and_extract('https://abc.org/test.tar.gz')

# Parallel download: list -> list
image_files = dl_manager.download(
    ['https://a.org/1.jpg', 'https://a.org/2.jpg', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': 'https://abc.org/train.zip',
   'test': 'https://abc.org/test.zip',
})
data_dirs['train']
data_dirs['test']

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a tfds.download.Resource as argument.

__init__

__init__(
    download_dir,
    extract_dir=None,
    manual_dir=None,
    dataset_name=None,
    checksums=None,
    force_download=False,
    force_extraction=False
)

Download manager constructor.

Args:

  • download_dir: str, path to directory where downloads are stored.
  • extract_dir: str, path to directory where artifacts are extracted.
  • manual_dir: str, path to manually downloaded/extracted data directory.
  • dataset_name: str, name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
  • checksums: dict<str url, str sha256>, url to sha256 of resource. Only URLs present are checked. If empty, checksum of (already) downloaded files is computed and can then be retrieved using recorded_download_checksums property.
  • force_download: bool, default to False. If True, always [re]download.
  • force_extraction: bool, default to False. If True, always [re]extract.

Properties

download_sizes

Returns sizes (in bytes) for downloaded urls.

manual_dir

Returns the directory containing the manually extracted data.

recorded_download_checksums

Returns checksums for downloaded urls.

Methods

download

download(url_or_urls)

Download given url(s).

Args:

  • url_or_urls: url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource.

Returns:

downloaded_path(s): str, The downloaded paths matching the given input url_or_urls.

download_and_extract

download_and_extract(url_or_urls)

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

Args:

  • url_or_urls: url or list/dict of urls to download and extract. Each url can be a str or tfds.download.Resource.

If not explicitly specified in Resource, the extraction method will automatically be deduced from downloaded file name.

Returns:

extracted_path(s): str, extracted paths of given URL(s).

extract

extract(path_or_paths)

Extract given path(s).

Args:

  • path_or_paths: path or list/dict of path of file to extract. Each path can be a str or tfds.download.Resource.

If not explicitly specified in Resource, the extraction method is deduced from downloaded file name.

Returns:

extracted_path(s): str, The extracted paths matching the given input path_or_paths.

iter_archive

iter_archive(resource)

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

Args:

Returns:

Generator yielding tuple (path_within_archive, file_obj).