Normally when you use TensorFlow Datasets, the downloaded and prepared data
will be cached in a local directory (by default
In some environments where local disk may be ephemeral (a temporary cloud server
or a Colab notebook) or you need the data
to be accessible by multiple machines, it's useful to
data_dir to a cloud storage system, like a Google Cloud Storage (GCS)
- First, create a GCS bucket and ensure you have read/write permissions on it.
- If you'll be running from GCP machines where your personal credentials may not be available, you may want to create a service account and give it permissions on your bucket.
- On a non-GCP machine, you'll have to use export a service account's key as JSON ( instructions to create a new key) and set the environment variable:
When you use
tfds, you can set
ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"], data_dir="gs://YOUR_BUCKET_NAME")
- This approach works for datasets that only use
tf.io.gfilefor data access. This is true for most datasets, but not all.
- Remember that accessing GCS is accessing a remote server and streaming data from it, so you may incur network costs.