How to run TensorFlow on S3

This document describes how to run TensorFlow on S3 file system.

S3

We assume that you are familiar with reading data.

To use S3 with TensorFlow, change the file paths you use to read and write data to an S3 path. For example:

filenames = ["s3://bucketname/path/to/file1.tfrecord",
             "s3://bucketname/path/to/file2.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

When reading or writing data on S3 with your TensorFlow program, the behavior could be controlled by various environmental variables:

  • AWS_REGION: By default, regional endpoint is used for S3, with region controlled by AWS_REGION. If AWS_REGION is not specified, then us-east-1 is used.
  • S3_ENDPOINT: The endpoint could be overridden explicitly with S3_ENDPOINT specified.
  • S3_USE_HTTPS: HTTPS is used to access S3 by default, unless S3_USE_HTTPS=0.
  • S3_VERIFY_SSL: If HTTPS is used, SSL verification could be disabled with S3_VERIFY_SSL=0.

To read or write objects in a bucket that is no publicly accessible, AWS credentials must be provided through one of the following methods:

  • Set credentials in the AWS credentials profile file on the local system, located at: ~/.aws/credentials on Linux, macOS, or Unix, or C:\Users\USERNAME\.aws\credentials on Windows.
  • Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
  • If TensorFlow is deployed on an EC2 instance, specify an IAM role and then give the EC2 instance access to that role.