TFDS CLI

Stay organized with collections Save and categorize content based on your preferences.

TFDS CLI is a command-line tool that provides various commands to easily work with TensorFlow Datasets.

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook
Disable TF logs on import
%%capture
%env TF_CPP_MIN_LOG_LEVEL=1  # Disable logs on TF import

Installation

The CLI tool is installed with tensorflow-datasets (or tfds-nightly).

pip install -q tfds-nightly
tfds --version

For the list of all CLI commands:

tfds --help
2022-12-14 12:08:08.526375: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:08.526480: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:08.526502: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
usage: tfds [-h] [--helpfull] [--version] {build,new} ...

Tensorflow Datasets CLI tool

optional arguments:
  -h, --help   show this help message and exit
  --helpfull   show full help message and exit
  --version    show program's version number and exit

command:
  {build,new}
    build      Commands for downloading and preparing datasets.
    new        Creates a new dataset directory from the template.

tfds new: Implementing a new Dataset

This command will help you kickstart writing your new Python dataset by creating a <dataset_name>/ directory containing default implementation files.

Usage:

tfds new my_dataset
2022-12-14 12:08:11.608531: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:11.608645: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:11.608669: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-12-14 12:08:13.028620: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Traceback (most recent call last):
  File "/tmpfs/src/tf_docs_env/bin/tfds", line 8, in <module>
    sys.exit(launch_cli())
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/scripts/cli/main.py", line 104, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/scripts/cli/main.py", line 99, in main
    args.subparser_fn(args)
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/scripts/cli/new.py", line 65, in _create_dataset_files
    create_dataset_files(
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/scripts/cli/new.py", line 90, in create_dataset_files
    _create_dataset_tags(info)
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/scripts/cli/new.py", line 172, in _create_dataset_tags
    dataset_metadata.valid_tags_with_comments())
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_metadata.py", line 69, in valid_tags_with_comments
    line for line in _get_valid_tags_text().split("\n")
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_metadata.py", line 58, in _get_valid_tags_text
    return path.read_text("utf-8")
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/etils/epath/abstract_path.py", line 144, in read_text
    with self.open('r', encoding=encoding) as f:
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/etils/epath/gpath.py", line 226, in open
    gfile = self._backend.open(self._path_str, mode)
  File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/etils/epath/backend.py", line 104, in open
    return open(path, mode, encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_datasets/core/valid_tags.txt'

tfds new my_dataset will create:

ls -1 my_dataset/
CITATIONS.bib
README.md
my_dataset.py
my_dataset_test.py

An optional flag --data_format can be used to generate format-specific dataset builders (e.g., conll). If no data format is given, it will generate a template for a standard <a href="https://www.tensorflow.org/datasets/api_docs/python/tfds/core/GeneratorBasedBuilder"><code>tfds.core.GeneratorBasedBuilder</code></a>. Refer to the documentation for details on the available format-specific dataset builders.

See our writing dataset guide for more info.

Available options:

tfds new --help
2022-12-14 12:08:14.867979: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:14.868082: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:14.868104: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
usage: tfds new [-h] [--helpfull] [--data_format {standard,conll,conllu}]
                [--dir DIR]
                dataset_name

positional arguments:
  dataset_name          Name of the dataset to be created (in snake_case)

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --data_format {standard,conll,conllu}
                        Optional format of the input data, which is used to
                        generate a format-specific template.
  --dir DIR             Path where the dataset directory will be created.
                        Defaults to current directory.

tfds build: Download and prepare a dataset

Use tfds build <my_dataset> to generate a new dataset. <my_dataset> can be:

  • A path to dataset/ folder or dataset.py file (empty for current directory):

    • tfds build datasets/my_dataset/
    • cd datasets/my_dataset/ && tfds build
    • cd datasets/my_dataset/ && tfds build my_dataset
    • cd datasets/my_dataset/ && tfds build my_dataset.py
  • A registered dataset:

    • tfds build mnist
    • tfds build my_dataset --imports my_project.datasets

Available options:

tfds build --help
2022-12-14 12:08:17.925527: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:17.925628: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 12:08:17.925650: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
usage: tfds build [-h] [--helpfull]
                  [--datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]]
                  [--overwrite] [--fail_if_exists]
                  [--max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]]
                  [--data_dir DATA_DIR] [--download_dir DOWNLOAD_DIR]
                  [--extract_dir EXTRACT_DIR] [--manual_dir MANUAL_DIR]
                  [--add_name_to_manual_dir] [--download_only]
                  [--config CONFIG] [--config_idx CONFIG_IDX]
                  [--imports IMPORTS] [--register_checksums]
                  [--force_checksums_validation]
                  [--beam_pipeline_options BEAM_PIPELINE_OPTIONS]
                  [--file_format FILE_FORMAT] [--publish_dir PUBLISH_DIR]
                  [--skip_if_published] [--exclude_datasets EXCLUDE_DATASETS]
                  [--experimental_latest_version]
                  [datasets ...]

positional arguments:
  datasets              Name(s) of the dataset(s) to build. Default to current
                        dir. See https://www.tensorflow.org/datasets/cli for
                        accepted values.

optional arguments:
  -h, --help            show this help message and exit
  --helpfull            show full help message and exit
  --datasets DATASETS_KEYWORD [DATASETS_KEYWORD ...]
                        Datasets can also be provided as keyword argument.

Debug & tests:
  --pdb Enter post-mortem debugging mode if an exception is raised.

  --overwrite           Delete pre-existing dataset if it exists.
  --fail_if_exists      Fails the program if there is a pre-existing dataset.
  --max_examples_per_split [MAX_EXAMPLES_PER_SPLIT]
                        When set, only generate the first X examples (default
                        to 1), rather than the full dataset.If set to 0, only
                        execute the `_split_generators` (which download the
                        original data), but skip `_generator_examples`

Paths:
  --data_dir DATA_DIR   Where to place datasets. Default to
                        `~/tensorflow_datasets/` or `TFDS_DATA_DIR`
                        environement variable.
  --download_dir DOWNLOAD_DIR
                        Where to place downloads. Default to
                        `<data_dir>/downloads/`.
  --extract_dir EXTRACT_DIR
                        Where to extract files. Default to
                        `<download_dir>/extracted/`.
  --manual_dir MANUAL_DIR
                        Where to manually download data (required for some
                        datasets). Default to `<download_dir>/manual/`.
  --add_name_to_manual_dir
                        If true, append the dataset name to the `manual_dir`
                        (e.g. `<download_dir>/manual/<dataset_name>/`. Useful
                        to avoid collisions if many datasets are generated.

Generation:
  --download_only       If True, download all files but do not prepare the
                        dataset. Uses the checksum.tsv to find out what to
                        download. Therefore, this does not work in combination
                        with --register_checksums.
  --config CONFIG, -c CONFIG
                        Config name to build. Build all configs if not set.
                        Can also be a json of the kwargs forwarded to the
                        config `__init__` (for custom configs).
  --config_idx CONFIG_IDX
                        Config id to build
                        (`builder_cls.BUILDER_CONFIGS[config_idx]`). Mutually
                        exclusive with `--config`.
  --imports IMPORTS, -i IMPORTS
                        Comma separated list of module to import to register
                        datasets.
  --register_checksums  If True, store size and checksum of downloaded files.
  --force_checksums_validation
                        If True, raise an error if the checksums are not
                        found.
  --beam_pipeline_options BEAM_PIPELINE_OPTIONS
                        A (comma-separated) list of flags to pass to
                        `PipelineOptions` when preparing with Apache Beam.
                        (see:
                        https://www.tensorflow.org/datasets/beam_datasets).
                        Example: `--beam_pipeline_options=job_name=my-
                        job,project=my-project`
  --file_format FILE_FORMAT
                        File format to which generate the tf-examples.
                        Available values: ['tfrecord', 'riegeli'] (see
                        `tfds.core.FileFormat`).

Publishing:
  Options for publishing successfully created datasets.

  --publish_dir PUBLISH_DIR
                        Where to optionally publish the dataset after it has
                        been generated successfully. Should be the root data
                        dir under whichdatasets are stored. If unspecified,
                        dataset will not be published
  --skip_if_published   If the dataset with the same version and config is
                        already published, then it will not be regenerated.

Automation:
  Used by automated scripts.

  --exclude_datasets EXCLUDE_DATASETS
                        If set, generate all datasets except the one defined
                        here. Comma separated list of datasets to exclude.
  --experimental_latest_version
                        Build the latest Version(experiments=...) available
                        rather than default version.