TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

tfds.download.DownloadManager

Manages the download and extraction of files, as well as caching.

tfds.download.DownloadManager(
    *,
    download_dir: epath.PathLike,
    extract_dir: Optional[epath.PathLike] = None,
    manual_dir: Optional[epath.PathLike] = None,
    manual_dir_instructions: Optional[str] = None,
    url_infos: Optional[Dict[str, checksums.UrlInfo]] = None,
    dataset_name: Optional[str] = None,
    force_download: bool = False,
    force_extraction: bool = False,
    force_checksums_validation: bool = False,
    register_checksums: bool = False,
    register_checksums_path: Optional[epath.PathLike] = None,
    verify_ssl: bool = True,
    max_simultaneous_downloads: Optional[int] = None
)

Downloaded files are cached under download_dir. The file name of downloaded files follows pattern "{sanitized_url}{content_checksum}.{ext}". Eg: 'cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz'.

While a file is being downloaded, it is placed into a directory following a similar but different pattern: "{sanitized_url}{url_checksum}.tmp.{uuid}".

When a file is downloaded, a "{fname}.INFO.json" file is created next to it. This INFO file contains the following information: {"dataset_names": ["name1", "name2"], "urls": ["http://url.of/downloaded_file"]}

Extracted files/dirs are stored under extract_dir. The file name or directory name is the same as the original name, prefixed with the extraction method. E.g. "{extract_dir}/TAR_GZ.cs.toronto.edu_kriz_cifar-100-pythonJDF[...]I.tar.gz".

The function members accept either plain value, or values wrapped into list or dict. Giving a data structure will parallelize the downloads.

Example of usage:

# Sequential download: str -> str
train_dir = dl_manager.download_and_extract('https://abc.org/train.tar.gz')
test_dir = dl_manager.download_and_extract('https://abc.org/test.tar.gz')

# Parallel download: list -> list
image_files = dl_manager.download(
    ['https://a.org/1.jpg', 'https://a.org/2.jpg', ...])

# Parallel download: dict -> dict
data_dirs = dl_manager.download_and_extract({
   'train': 'https://abc.org/train.zip',
   'test': 'https://abc.org/test.zip',
})
data_dirs['train']
data_dirs['test']

For more customization on the download/extraction (ex: passwords, output_name, ...), you can pass a tfds.download.Resource as argument.

Args
`download_dir`	Path to directory where downloads are stored.
`extract_dir`	Path to directory where artifacts are extracted.
`manual_dir`	Path to manually downloaded/extracted data directory.
`manual_dir_instructions`	Human readable instructions on how to prepare contents of the manual_dir for this dataset.
`url_infos`	Urls info for the checksums.
`dataset_name`	Name of dataset this instance will be used for. If provided, downloads will contain which datasets they were used for.
`force_download`	If True, always [re]download.
`force_extraction`	If True, always [re]extract.
`force_checksums_validation`	If True, raises an error if an URL do not have checksums.
`register_checksums`	If True, dl checksums aren't checked, but stored into file.
`register_checksums_path`	Path were to save checksums. Should be set if register_checksums is True.
`verify_ssl`	`bool`, defaults to True. If True, will verify certificate when downloading dataset.
`max_simultaneous_downloads`	`int`, optional max number of simultaneous downloads.

Raises
`FileNotFoundError`	Raised if the register_checksums_path does not exist.

Attributes
`download_dir`
`downloaded_size`	Returns the total size of downloaded files.
`manual_dir`	Returns the directory containing the manually extracted data.
`register_checksums`	Returns whether checksums are being computed and recorded to file.

Methods

`download`

View source

download(
    url_or_urls
)

Download given url(s).

Args
`url_or_urls`	url or `list`/`dict` of urls to download and extract. Each url can be a `str` or `tfds.download.Resource`.

Returns

Returns
`downloaded_path`	`s` `str`, The downloaded paths matching the given input url_or_urls.

downloaded_path

s

str, The downloaded paths matching the given input url_or_urls.

`download_and_extract`

View source

download_and_extract(
    url_or_urls
)

Download and extract given url_or_urls.

Is roughly equivalent to:

extracted_paths = dl_manager.extract(dl_manager.download(url_or_urls))

Args
`url_or_urls`	url or `list`/`dict` of urls to download and extract. Each url can be a `str` or `tfds.download.Resource`. If not explicitly specified in `Resource`, the extraction method will automatically be deduced from downloaded file name.

Returns

Returns
`extracted_path`	`s` `str`, extracted paths of given URL(s).

extracted_path

s

str, extracted paths of given URL(s).

`download_checksums`

View source

download_checksums(
    checksums_url
)

Downloads checksum file from the given URL and adds it to registry.

`download_kaggle_data`

View source

download_kaggle_data(
    competition_or_dataset: str
) -> epath.Path

Download data for a given Kaggle Dataset or competition.

Args
`competition_or_dataset`	Dataset name (`zillow/zecon`) or competition name (`titanic`)

Returns
The path to the downloaded files.

`extract`

View source

extract(
    path_or_paths
)

Extract given path(s).

Args
`path_or_paths`	path or `list`/`dict` of path of file to extract. Each path can be a `str` or `tfds.download.Resource`. If not explicitly specified in `Resource`, the extraction method is deduced from downloaded file name.

Returns

Returns
`extracted_path`	`s` `str`, The extracted paths matching the given input path_or_paths.

extracted_path

s

str, The extracted paths matching the given input path_or_paths.

`iter_archive`

View source

iter_archive(
    resource: ExtractPath
) -> Iterator[Tuple[str, typing.BinaryIO]]

Returns iterator over files within archive.

Important Note: caller should read files as they are yielded. Reading out of order is slow.

Args
`resource`	path to archive or `tfds.download.Resource`.

Returns
Generator yielding tuple (path_within_archive, file_obj).

tfds.download.DownloadManager

Example of usage:

Args

Raises

Attributes

Methods

download

download_and_extract

download_checksums

download_kaggle_data

extract

iter_archive

`download`

`download_and_extract`

`download_checksums`

`download_kaggle_data`

`extract`

`iter_archive`