This document describes how to run TensorFlow on Hadoop. It will be expanded to describe running on various cluster managers, but only describes running on HDFS at the moment.
We assume that you are familiar with reading data.
To use HDFS with TensorFlow, change the file paths you use to read and write data to an HDFS path. For example:
filename_queue = tf.train.string_input_producer([ "hdfs://namenode:8020/path/to/file1.csv", "hdfs://namenode:8020/path/to/file2.csv", ])
If you want to use the namenode specified in your HDFS configuration files, then
change the file prefix to
When launching your TensorFlow program, the following environment variables must be set:
- JAVA_HOME: The location of your Java installation.
- HADOOP_HDFS_HOME: The location of your HDFS installation. You can also set this environment variable by running:
- LD_LIBRARY_PATH: To include the path to libjvm.so. On Linux:
- CLASSPATH: The Hadoop jars must be added prior to running your
TensorFlow program. The CLASSPATH set by
$HADOOP_HOME/libexec/hadoop-config.shis insufficient. Globs must be expanded as described in the libhdfs documentation:
CLASSPATH=$($HADOOP_HDFS_HOME/bin/hadoop classpath --glob) python your_script.py
If you are running Distributed TensorFlow, then all workers must have the environment variables set and Hadoop installed.