TensorFlow 2.0 RC is available Learn more

Build a linear model with Estimators

View on TensorFlow.org Run in Google Colab View source on GitHub

This tutorial uses the tf.estimator API in TensorFlow to solve a benchmark binary classification problem. Estimators are TensorFlow's most scalable and production-oriented model type. For more information see the Estimator guide.

Overview

Using census data which contains data a person's age, education, marital status, and occupation (the features), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target label). We will train a logistic regression model that, given an individual's information, outputs a number between 0 and 1—this can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

Setup

Import TensorFlow, feature column support, and supporting modules:

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

And let's enable eager execution to inspect this program as we run it:

tf.enable_eager_execution()

Download the official implementation

We'll use the wide and deep model available in TensorFlow's model repository. Download the code, add the root directory to your Python path, and jump to the wide_deep directory:

! pip install -q requests
! git clone --depth 1 https://github.com/tensorflow/models
Cloning into 'models'...
remote: Enumerating objects: 3173, done.
remote: Counting objects: 100% (3173/3173), done.
remote: Compressing objects: 100% (2688/2688), done.
remote: Total 3173 (delta 574), reused 2071 (delta 407), pack-reused 0
Receiving objects: 100% (3173/3173), 370.58 MiB | 16.52 MiB/s, done.
Resolving deltas: 100% (574/574), done.
Checking connectivity... done.

Add the root directory of the repository to your Python path:

models_path = os.path.join(os.getcwd(), 'models')

sys.path.append(models_path)

Download the dataset:

from official.wide_deep import census_dataset
from official.wide_deep import census_main

census_dataset.download("/tmp/census_data/")
WARNING: Logging before flag parsing goes to stderr.
W0625 16:04:36.412110 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:78: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

W0625 16:04:36.413802 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:81: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0625 16:04:38.253764 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:62: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.

W0625 16:04:38.488776 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:73: The name tf.gfile.Remove is deprecated. Please use tf.io.gfile.remove instead.

Command line usage

The repo includes a complete program for experimenting with this type of model.

To execute the tutorial code from the command line first add the path to tensorflow/models to your PYTHONPATH.

#export PYTHONPATH=${PYTHONPATH}:"$(pwd)/models"
#running from python you need to set the `os.environ` or the subprocess will not see the directory.

if "PYTHONPATH" in os.environ:
  os.environ['PYTHONPATH'] += os.pathsep +  models_path
else:
  os.environ['PYTHONPATH'] = models_path

Use --help to see what command line options are available:

!python -m official.wide_deep.census_main --help
WARNING: Logging before flag parsing goes to stderr.
W0625 16:04:41.850976 140369815934720 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:114: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0625 16:04:41.851271 140369815934720 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:114: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

Train DNN on census income dataset.
flags:

/tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:
  -bs,--batch_size:
    Batch size for training and evaluation. When using multiple gpus, this is
    the
    global batch size for all devices. For example, if the batch size is 32 and
    there are 4 GPUs, each GPU will get 8 examples on each step.
    (default: '40')
    (an integer)
  --[no]clean:
    If set, model_dir will be removed if it exists.
    (default: 'false')
  -dd,--data_dir:
    The location of the input data.
    (default: '/tmp/census_data')
  --[no]download_if_missing:
    Download data to data_dir if it is not already present.
    (default: 'true')
  -ebe,--epochs_between_evals:
    The number of training epochs to run between evaluations.
    (default: '2')
    (an integer)
  -ed,--export_dir:
    If set, a SavedModel serialization of the model will be exported to this
    directory at the end of training. See the README for more details and
    relevant
    links.
  -hk,--hooks:
    A list of (case insensitive) strings to specify the names of training hooks.
      Hook:
        profilerhook
        loggingtensorhook
        loggingmetrichook
        examplespersecondhook
      Example: `--hooks ProfilerHook,ExamplesPerSecondHook`
    See official.utils.logs.hooks_helper for details.
    (default: 'LoggingTensorHook')
    (a comma separated list)
  -md,--model_dir:
    The location of the model checkpoint files.
    (default: '/tmp/census_model')
  -mt,--model_type: <wide|deep|wide_deep>: Select model topology.
    (default: 'wide_deep')
  -te,--train_epochs:
    The number of epochs used to train.
    (default: '40')
    (an integer)

Try --helpfull to get a list of all flags.

Now run the model:

!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2
WARNING: Logging before flag parsing goes to stderr.
W0625 16:04:44.035587 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:114: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

W0625 16:04:44.035927 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:114: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

W0625 16:04:44.039005 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:78: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

W0625 16:04:44.039243 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:81: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0625 16:04:44.040593 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_main.py:49: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

I0625 16:04:44.041406 140008099481344 estimator.py:209] Using config: {'_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_task_id': 0, '_is_chief': True, '_save_checkpoints_secs': 600, '_num_worker_replicas': 1, '_session_config': device_count {
  key: "GPU"
  value: 0
}
, '_task_type': 'worker', '_master': '', '_experimental_max_worker_delay_secs': None, '_experimental_distribute': None, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_global_id_in_cluster': 0, '_model_dir': '/tmp/census_model', '_num_ps_replicas': 0, '_service': None, '_protocol': None, '_keep_checkpoint_max': 5, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f55c873f470>, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_device_fn': None}
W0625 16:04:44.042670 140008099481344 logger.py:391] 'cpuinfo' not imported. CPU info will not be logged.
2019-06-25 16:04:44.042943: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-06-25 16:04:44.070341: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-25 16:04:44.227732: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.228457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4f53b00 executing computations on platform CUDA. Devices:
2019-06-25 16:04:44.228519: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2019-06-25 16:04:44.232446: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000175000 Hz
2019-06-25 16:04:44.233333: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4fc5c80 executing computations on platform Host. Devices:
2019-06-25 16:04:44.233368: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-06-25 16:04:44.233585: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.234034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:05.0
2019-06-25 16:04:44.234339: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-25 16:04:44.235853: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-25 16:04:44.237356: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-25 16:04:44.237728: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-25 16:04:44.239673: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-25 16:04:44.241186: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-25 16:04:44.245276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-25 16:04:44.245409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.245857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.246289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-06-25 16:04:44.246345: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-25 16:04:44.247536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-25 16:04:44.247568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-06-25 16:04:44.247576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-06-25 16:04:44.247909: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.248383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-25 16:04:44.248848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 14850 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)
I0625 16:04:44.280929 140008099481344 logger.py:152] Benchmark run: {'dataset': {'name': 'Census Income'}, 'run_parameters': [{'name': 'batch_size', 'long_value': 40}, {'string_value': 'wide', 'name': 'model_type'}, {'name': 'train_epochs', 'long_value': 2}], 'test_environment': 'GCP', 'model_name': 'wide_deep', 'machine_config': {'gpu_info': {'model': 'Tesla V100-SXM2-16GB', 'count': 1}, 'memory_total': 31616577536, 'memory_available': 30082547712}, 'run_date': '2019-06-25T16:04:44.042097Z', 'tensorflow_version': {'git_hash': 'v1.14.0-rc1-22-gaf24dc91b5', 'version': '1.14.0'}, 'test_id': None, 'tensorflow_environment_variables': []}
W0625 16:04:44.287105 140008099481344 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0625 16:04:44.318942 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:167: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

I0625 16:04:44.319118 140008099481344 census_dataset.py:167] Parsing /tmp/census_data/adult.data
W0625 16:04:44.319229 140008099481344 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:168: The name tf.decode_csv is deprecated. Please use tf.io.decode_csv instead.

I0625 16:04:44.359962 140008099481344 estimator.py:1145] Calling model_fn.
W0625 16:04:44.713873 140008099481344 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/ops/sparse_ops.py:1719: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0625 16:04:45.240099 140008099481344 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow_estimator/python/estimator/canned/linear.py:308: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
I0625 16:04:45.909570 140008099481344 estimator.py:1147] Done calling model_fn.
I0625 16:04:45.909945 140008099481344 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0625 16:04:46.424942 140008099481344 monitored_session.py:240] Graph was finalized.
2019-06-25 16:04:46.425759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-25 16:04:46.425801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
2019-06-25 16:04:46.523912: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0625 16:04:46.574896 140008099481344 session_manager.py:500] Running local_init_op.
I0625 16:04:46.603173 140008099481344 session_manager.py:502] Done running local_init_op.
I0625 16:04:47.417989 140008099481344 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/census_model/model.ckpt.
I0625 16:04:48.227975 140008099481344 basic_session_run_hooks.py:262] average_loss = 0.6931472, loss = 27.725887
I0625 16:04:48.228657 140008099481344 basic_session_run_hooks.py:262] loss = 27.725887, step = 1
I0625 16:04:49.001455 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 129.194
I0625 16:04:49.002458 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.25440973, loss = 10.176389 (0.775 sec)
I0625 16:04:49.002712 140008099481344 basic_session_run_hooks.py:260] loss = 10.176389, step = 101 (0.774 sec)
I0625 16:04:49.388379 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 258.459
I0625 16:04:49.389233 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.23366055, loss = 9.346422 (0.387 sec)
I0625 16:04:49.389536 140008099481344 basic_session_run_hooks.py:260] loss = 9.346422, step = 201 (0.387 sec)
I0625 16:04:49.773041 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 259.969
I0625 16:04:49.773958 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.3713664, loss = 14.854656 (0.385 sec)
I0625 16:04:49.774183 140008099481344 basic_session_run_hooks.py:260] loss = 14.854656, step = 301 (0.385 sec)
I0625 16:04:50.148080 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 266.637
I0625 16:04:50.149002 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.29316738, loss = 11.726695 (0.375 sec)
I0625 16:04:50.149297 140008099481344 basic_session_run_hooks.py:260] loss = 11.726695, step = 401 (0.375 sec)
I0625 16:04:50.513535 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 273.614
I0625 16:04:50.514322 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.25354862, loss = 10.141945 (0.365 sec)
I0625 16:04:50.514591 140008099481344 basic_session_run_hooks.py:260] loss = 10.141945, step = 501 (0.365 sec)
I0625 16:04:50.829429 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 316.542
I0625 16:04:50.830225 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.23268667, loss = 9.3074665 (0.316 sec)
I0625 16:04:50.830438 140008099481344 basic_session_run_hooks.py:260] loss = 9.3074665, step = 601 (0.316 sec)
I0625 16:04:51.121514 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 342.442
I0625 16:04:51.122241 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.22557859, loss = 9.023144 (0.292 sec)
I0625 16:04:51.122428 140008099481344 basic_session_run_hooks.py:260] loss = 9.023144, step = 701 (0.292 sec)
I0625 16:04:51.413793 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 342.071
I0625 16:04:51.414545 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.2859376, loss = 11.437505 (0.292 sec)
I0625 16:04:51.414744 140008099481344 basic_session_run_hooks.py:260] loss = 11.437505, step = 801 (0.292 sec)
I0625 16:04:51.759189 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 289.524
I0625 16:04:51.759977 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.22834139, loss = 9.133656 (0.345 sec)
I0625 16:04:51.760218 140008099481344 basic_session_run_hooks.py:260] loss = 9.133656, step = 901 (0.345 sec)
I0625 16:04:52.045602 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 349.123
I0625 16:04:52.046190 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.3480237, loss = 13.920948 (0.286 sec)
I0625 16:04:52.046396 140008099481344 basic_session_run_hooks.py:260] loss = 13.920948, step = 1001 (0.286 sec)
I0625 16:04:52.330088 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 351.545
I0625 16:04:52.330838 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.25689358, loss = 10.275743 (0.285 sec)
I0625 16:04:52.331063 140008099481344 basic_session_run_hooks.py:260] loss = 10.275743, step = 1101 (0.285 sec)
I0625 16:04:52.620900 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 343.852
I0625 16:04:52.621669 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.22275586, loss = 8.910234 (0.291 sec)
I0625 16:04:52.621881 140008099481344 basic_session_run_hooks.py:260] loss = 8.910234, step = 1201 (0.291 sec)
I0625 16:04:52.910346 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 345.488
I0625 16:04:52.911042 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.352034, loss = 14.08136 (0.289 sec)
I0625 16:04:52.911252 140008099481344 basic_session_run_hooks.py:260] loss = 14.08136, step = 1301 (0.289 sec)
I0625 16:04:53.194451 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 351.998
I0625 16:04:53.195210 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.48226565, loss = 19.290627 (0.284 sec)
I0625 16:04:53.195415 140008099481344 basic_session_run_hooks.py:260] loss = 19.290627, step = 1401 (0.284 sec)
I0625 16:04:53.491912 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 336.162
I0625 16:04:53.492692 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.4897923, loss = 19.591692 (0.297 sec)
I0625 16:04:53.492927 140008099481344 basic_session_run_hooks.py:260] loss = 19.591692, step = 1501 (0.298 sec)
I0625 16:04:53.787713 140008099481344 basic_session_run_hooks.py:692] global_step/sec: 338.066
I0625 16:04:53.788544 140008099481344 basic_session_run_hooks.py:260] average_loss = 0.53748876, loss = 21.49955 (0.296 sec)
I0625 16:04:53.788762 140008099481344 basic_session_run_hooks.py:260] loss = 21.49955, step = 1601 (0.296 sec)
I0625 16:04:53.875792 140008099481344 basic_session_run_hooks.py:606] Saving checkpoints for 1629 into /tmp/census_model/model.ckpt.
I0625 16:04:54.014202 140008099481344 estimator.py:368] Loss for final step: 1.0886656.
I0625 16:04:54.032937 140008099481344 census_dataset.py:167] Parsing /tmp/census_data/adult.test
I0625 16:04:54.071678 140008099481344 estimator.py:1145] Calling model_fn.
W0625 16:04:55.119668 140008099481344 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/ops/metrics_impl.py:2027: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
W0625 16:04:55.527082 140008099481344 metrics_impl.py:804] Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
W0625 16:04:55.549229 140008099481344 metrics_impl.py:804] Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
I0625 16:04:55.571621 140008099481344 estimator.py:1147] Done calling model_fn.
I0625 16:04:55.593022 140008099481344 evaluation.py:255] Starting evaluation at 2019-06-25T16:04:55Z
I0625 16:04:55.730614 140008099481344 monitored_session.py:240] Graph was finalized.
2019-06-25 16:04:55.731149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-25 16:04:55.731177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0625 16:04:55.731291 140008099481344 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I0625 16:04:55.732383 140008099481344 saver.py:1280] Restoring parameters from /tmp/census_model/model.ckpt-1629
I0625 16:04:55.844604 140008099481344 session_manager.py:500] Running local_init_op.
I0625 16:04:55.907182 140008099481344 session_manager.py:502] Done running local_init_op.
I0625 16:04:57.354998 140008099481344 evaluation.py:275] Finished evaluation at 2019-06-25-16:04:57
I0625 16:04:57.355313 140008099481344 estimator.py:2039] Saving dict for global step 1629: accuracy = 0.83397824, accuracy_baseline = 0.76377374, auc = 0.8841904, auc_precision_recall = 0.6952039, average_loss = 0.35092422, global_step = 1629, label/mean = 0.23622628, loss = 14.003426, precision = 0.67765, prediction/mean = 0.243432, recall = 0.56682265
I0625 16:04:57.585783 140008099481344 estimator.py:2099] Saving 'checkpoint_path' summary for global step 1629: /tmp/census_model/model.ckpt-1629
I0625 16:04:57.586608 140008099481344 wide_deep_run_loop.py:116] Results at epoch 2 / 2
I0625 16:04:57.586776 140008099481344 wide_deep_run_loop.py:117] ------------------------------------------------------------
I0625 16:04:57.586858 140008099481344 wide_deep_run_loop.py:120] accuracy: 0.83397824
I0625 16:04:57.586925 140008099481344 wide_deep_run_loop.py:120] accuracy_baseline: 0.76377374
I0625 16:04:57.586991 140008099481344 wide_deep_run_loop.py:120] auc: 0.8841904
I0625 16:04:57.587061 140008099481344 wide_deep_run_loop.py:120] auc_precision_recall: 0.6952039
I0625 16:04:57.587121 140008099481344 wide_deep_run_loop.py:120] average_loss: 0.35092422
I0625 16:04:57.587196 140008099481344 wide_deep_run_loop.py:120] global_step: 1629
I0625 16:04:57.587252 140008099481344 wide_deep_run_loop.py:120] label/mean: 0.23622628
I0625 16:04:57.587306 140008099481344 wide_deep_run_loop.py:120] loss: 14.003426
I0625 16:04:57.587360 140008099481344 wide_deep_run_loop.py:120] precision: 0.67765
I0625 16:04:57.587414 140008099481344 wide_deep_run_loop.py:120] prediction/mean: 0.243432
I0625 16:04:57.587486 140008099481344 wide_deep_run_loop.py:120] recall: 0.56682265
I0625 16:04:57.587665 140008099481344 logger.py:147] Benchmark metric: {'value': 0.8339782357215881, 'name': 'accuracy', 'timestamp': '2019-06-25T16:04:57.587601Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.587777 140008099481344 logger.py:147] Benchmark metric: {'value': 0.7637737393379211, 'name': 'accuracy_baseline', 'timestamp': '2019-06-25T16:04:57.587756Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.587867 140008099481344 logger.py:147] Benchmark metric: {'value': 0.8841903805732727, 'name': 'auc', 'timestamp': '2019-06-25T16:04:57.587848Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.587953 140008099481344 logger.py:147] Benchmark metric: {'value': 0.6952039003372192, 'name': 'auc_precision_recall', 'timestamp': '2019-06-25T16:04:57.587933Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588047 140008099481344 logger.py:147] Benchmark metric: {'value': 0.35092422366142273, 'name': 'average_loss', 'timestamp': '2019-06-25T16:04:57.588029Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588135 140008099481344 logger.py:147] Benchmark metric: {'value': 0.23622627556324005, 'name': 'label/mean', 'timestamp': '2019-06-25T16:04:57.588117Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588216 140008099481344 logger.py:147] Benchmark metric: {'value': 14.003425598144531, 'name': 'loss', 'timestamp': '2019-06-25T16:04:57.588199Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588296 140008099481344 logger.py:147] Benchmark metric: {'value': 0.677649974822998, 'name': 'precision', 'timestamp': '2019-06-25T16:04:57.588279Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588376 140008099481344 logger.py:147] Benchmark metric: {'value': 0.24343200027942657, 'name': 'prediction/mean', 'timestamp': '2019-06-25T16:04:57.588359Z', 'global_step': 1629, 'extras': [], 'unit': None}
I0625 16:04:57.588462 140008099481344 logger.py:147] Benchmark metric: {'value': 0.5668226480484009, 'name': 'recall', 'timestamp': '2019-06-25T16:04:57.588439Z', 'global_step': 1629, 'extras': [], 'unit': None}

Read the U.S. Census data

This example uses the U.S Census Income Dataset from 1994 and 1995. We have provided the census_dataset.py script to download the data and perform a little cleanup.

Since the task is a binary classification problem, we'll construct a label column named "label" whose value is 1 if the income is over 50K, and 0 otherwise. For reference, see the input_fn in census_main.py.

Let's look at the data to see which columns we can use to predict the target label:

!ls  /tmp/census_data/
adult.data  adult.test
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

pandas provides some convenient utilities for data analysis. Here's a list of columns available in the Census Income dataset:

import pandas

train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)

train_df.head()

The columns are grouped into two types: categorical and continuous columns:

  • A column is called categorical if its value can only be one of the categories in a finite set. For example, the relationship status of a person (wife, husband, unmarried, etc.) or the education level (high school, college, etc.) are categorical columns.
  • A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.

Converting Data into Tensors

When building a tf.estimator model, the input data is specified by using an input function (or input_fn). This builder function returns a tf.data.Dataset of batches of (features-dict, label) pairs. It is not called until it is passed to tf.estimator.Estimator methods such as train and evaluate.

The input builder function returns the following pair:

  1. features: A dict from feature names to Tensors or SparseTensors containing batches of features.
  2. labels: A Tensor containing batches of labels.

The keys of the features are used to configure the model's input layer.

For small problems like this, it's easy to make a tf.data.Dataset by slicing the pandas.DataFrame:

def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
  label = df[label_key]
  ds = tf.data.Dataset.from_tensor_slices((dict(df),label))

  if shuffle:
    ds = ds.shuffle(10000)

  ds = ds.batch(batch_size).repeat(num_epochs)

  return ds

Since we have eager execution enabled, it's easy to inspect the resulting dataset:

ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys())[:5])
  print()
  print('A batch of Ages  :', feature_batch['age'])
  print()
  print('A batch of Labels:', label_batch )
Some feature keys: ['age', 'education', 'fnlwgt', 'hours_per_week', 'workclass']

A batch of Ages  : tf.Tensor([35 20 46 29 42 40 46 33 36 62], shape=(10,), dtype=int32)

A batch of Labels: tf.Tensor(
[b'<=50K' b'<=50K' b'<=50K' b'<=50K' b'>50K' b'<=50K' b'>50K' b'<=50K'
 b'<=50K' b'<=50K'], shape=(10,), dtype=string)

But this approach has severly-limited scalability. Larger datasets should be streamed from disk. The census_dataset.input_fn provides an example of how to do this using tf.decode_csv and tf.data.TextLineDataset:

import inspect
print(inspect.getsource(census_dataset.input_fn))
def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size)
  return dataset

This input_fn returns equivalent output:

ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Feature keys:', list(feature_batch.keys())[:5])
  print()
  print('Age batch   :', feature_batch['age'])
  print()
  print('Label batch :', label_batch )
W0625 16:04:58.875633 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:167: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W0625 16:04:58.876674 139807458662144 deprecation_wrapper.py:119] From /tmpfs/src/temp/site/en/tutorials/estimators/models/official/wide_deep/census_dataset.py:168: The name tf.decode_csv is deprecated. Please use tf.io.decode_csv instead.


Feature keys: ['age', 'education', 'fnlwgt', 'hours_per_week', 'workclass']

Age batch   : tf.Tensor([35 48 43 64 41 55 31 34 46 44], shape=(10,), dtype=int32)

Label batch : tf.Tensor([False False  True False  True  True False  True  True  True], shape=(10,), dtype=bool)

Because Estimators expect an input_fn that takes no arguments, we typically wrap configurable input function into an obejct with the expected signature. For this notebook configure the train_inpf to iterate over the data twice:

import functools

train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

Selecting and Engineering Features for the Model

Estimators use a system called feature columns to describe how the model should interpret each of the raw input features. An Estimator expects a vector of numeric inputs, and feature columns describe how the model should convert each feature.

Selecting and crafting the right set of feature columns is key to learning an effective model. A feature column can be either one of the raw inputs in the original features dict (a base feature column), or any new columns created using transformations defined over one or multiple base columns (a derived feature columns).

A feature column is an abstract concept of any raw or derived variable that can be used to predict the target label.

Base Feature Columns

Numeric columns

The simplest feature_column is numeric_column. This indicates that a feature is a numeric value that should be input to the model directly. For example:

age = fc.numeric_column('age')

The model will use the feature_column definitions to build the model input. You can inspect the resulting output using the input_layer function:

fc.input_layer(feature_batch, [age]).numpy()
W0625 16:04:59.032072 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:205: NumericColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:04:59.033224 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:2115: NumericColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:04:59.847502 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:206: NumericColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

array([[35.],
       [48.],
       [43.],
       [64.],
       [41.],
       [55.],
       [31.],
       [34.],
       [46.],
       [44.]], dtype=float32)

The following will train and evaluate a model using only the age feature:

classifier = tf.estimator.LinearClassifier(feature_columns=[age])
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()  # used for display in notebook
print(result)
{'precision': 0.1780822, 'label/mean': 0.23622628, 'accuracy_baseline': 0.76377374, 'prediction/mean': 0.23925366, 'auc': 0.6783024, 'loss': 33.404175, 'recall': 0.0033801352, 'average_loss': 0.5231905, 'auc_precision_recall': 0.31137863, 'accuracy': 0.7608869, 'global_step': 1018}

Similarly, we can define a NumericColumn for each continuous feature column that we want to use in the model:

education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]

fc.input_layer(feature_batch, my_numeric_columns).numpy()
array([[3.500e+01, 0.000e+00, 0.000e+00, 1.100e+01, 4.000e+01],
       [4.800e+01, 0.000e+00, 0.000e+00, 5.000e+00, 4.000e+01],
       [4.300e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],
       [6.400e+01, 0.000e+00, 0.000e+00, 8.000e+00, 4.000e+01],
       [4.100e+01, 0.000e+00, 0.000e+00, 9.000e+00, 4.000e+01],
       [5.500e+01, 0.000e+00, 0.000e+00, 1.300e+01, 6.000e+01],
       [3.100e+01, 0.000e+00, 0.000e+00, 1.300e+01, 3.500e+01],
       [3.400e+01, 0.000e+00, 1.848e+03, 1.300e+01, 4.000e+01],
       [4.600e+01, 7.298e+03, 0.000e+00, 9.000e+00, 4.000e+01],
       [4.400e+01, 4.386e+03, 0.000e+00, 1.400e+01, 4.000e+01]],
      dtype=float32)

You could retrain a model on these features by changing the feature_columns argument to the constructor:

classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))
accuracy: 0.78391993
accuracy_baseline: 0.76377374
auc: 0.68254405
auc_precision_recall: 0.5000775
average_loss: 0.991552
global_step: 1018
label/mean: 0.23622628
loss: 63.30768
precision: 0.6261538
prediction/mean: 0.21659191
recall: 0.21164846

Categorical columns

To define a feature column for a categorical feature, create a CategoricalColumn using one of the tf.feature_column.categorical_column* functions.

If you know the set of all possible feature values of a column—and there are only a few of them—use categorical_column_with_vocabulary_list. Each key in the list is assigned an auto-incremented ID starting from 0. For example, for the relationship column we can assign the feature string Husband to an integer ID of 0 and "Not-in-family" to 1, etc.

relationship = fc.categorical_column_with_vocabulary_list(
    'relationship',
    ['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative'])

This creates a sparse one-hot vector from the raw input feature.

The input_layer function we're using is designed for DNN models and expects dense inputs. To demonstrate the categorical column we must wrap it in a tf.feature_column.indicator_column to create the dense one-hot output (Linear Estimators can often skip this dense-step).

Run the input layer, configured with both the age and relationship columns:

fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])
W0625 16:05:17.995948 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:205: IndicatorColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:17.996864 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:2115: IndicatorColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:17.997511 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4236: VocabularyListCategoricalColumn._get_sparse_tensors (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:17.998070 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:2115: VocabularyListCategoricalColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:18.002923 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4207: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:18.003642 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4262: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

<tf.Tensor: id=5100, shape=(10, 7), dtype=float32, numpy=
array([[35.,  0.,  0.,  0.,  1.,  0.,  0.],
       [48.,  1.,  0.,  0.,  0.,  0.,  0.],
       [43.,  1.,  0.,  0.,  0.,  0.,  0.],
       [64.,  1.,  0.,  0.,  0.,  0.,  0.],
       [41.,  1.,  0.,  0.,  0.,  0.,  0.],
       [55.,  1.,  0.,  0.,  0.,  0.,  0.],
       [31.,  0.,  0.,  1.,  0.,  0.,  0.],
       [34.,  1.,  0.,  0.,  0.,  0.,  0.],
       [46.,  1.,  0.,  0.,  0.,  0.,  0.],
       [44.,  0.,  0.,  1.,  0.,  0.,  0.]], dtype=float32)>

If we don't know the set of possible values in advance, use the categorical_column_with_hash_bucket instead:

occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

Here, each possible value in the feature column occupation is hashed to an integer ID as we encounter them in training. The example batch has a few different occupations:

for item in feature_batch['occupation'].numpy():
    print(item.decode())
Prof-specialty
Machine-op-inspct
Sales
Craft-repair
Craft-repair
Sales
Prof-specialty
Prof-specialty
Other-service
Prof-specialty

If we run input_layer with the hashed column, we see that the output shape is (batch_size, hash_bucket_size):

occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])

occupation_result.numpy().shape
W0625 16:05:18.036207 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4236: HashedCategoricalColumn._get_sparse_tensors (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:18.037118 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:2115: HashedCategoricalColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:18.039789 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4262: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

(10, 1000)

It's easier to see the actual results if we take the tf.argmax over the hash_bucket_size dimension. Notice how any duplicate occupations are mapped to the same pseudo-random index:

tf.argmax(occupation_result, axis=1).numpy()
array([979, 911, 631, 466, 466, 631, 979, 979, 527, 979])

No matter how we choose to define a SparseColumn, each feature string is mapped into an integer ID by looking up a fixed mapping or by hashing. Under the hood, the LinearModel class is responsible for managing the mapping and creating tf.Variable to store the model parameters (model weights) for each feature ID. The model parameters are learned through the model training process described later.

Let's do the similar trick to define the other categorical features:

education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])


my_categorical_columns = [relationship, occupation, education, marital_status, workclass]

It's easy to use both sets of columns to configure a model that uses all these features:

classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))
accuracy: 0.8335483
accuracy_baseline: 0.76377374
auc: 0.8860394
auc_precision_recall: 0.70089626
average_loss: 0.4497595
global_step: 1018
label/mean: 0.23622628
loss: 28.715822
precision: 0.65130526
prediction/mean: 0.25957206
recall: 0.63572544

Derived feature columns

Make Continuous Features Categorical through Bucketization

Sometimes the relationship between a continuous feature and the label is not linear. For example, age and income—a person's income may grow in the early stage of their career, then the growth may slow at some point, and finally, the income decreases after retirement. In this scenario, using the raw age as a real-valued feature column might not be a good choice because the model can only learn one of the three cases:

  1. Income always increases at some rate as age grows (positive correlation),
  2. Income always decreases at some rate as age grows (negative correlation), or
  3. Income stays the same no matter at what age (no correlation).

If we want to learn the fine-grained correlation between income and each age group separately, we can leverage bucketization. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. So, we can define a bucketized_column over age as:

age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

boundaries is a list of bucket boundaries. In this case, there are 10 boundaries, resulting in 11 age group buckets (from age 17 and below, 18-24, 25-29, ..., to 65 and over).

With bucketing, the model sees each bucket as a one-hot feature:

fc.input_layer(feature_batch, [age, age_buckets]).numpy()
W0625 16:05:36.579093 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:205: BucketizedColumn._get_dense_tensor (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:36.579900 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:2115: BucketizedColumn._transform_feature (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0625 16:05:36.581717 139807458662144 deprecation.py:323] From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py:206: BucketizedColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

array([[35.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [48.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [43.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [64.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [41.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [55.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [31.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [34.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [46.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [44.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.]],
      dtype=float32)

Learn complex relationships with crossed column

Using each base feature column separately may not be enough to explain the data. For example, the correlation between education and the label (earning > 50,000 dollars) may be different for different occupations. Therefore, if we only learn a single model weight for education="Bachelors" and education="Masters", we won't capture every education-occupation combination (e.g. distinguishing between education="Bachelors" AND occupation="Exec-managerial" AND education="Bachelors" AND occupation="Craft-repair").

To learn the differences between different feature combinations, we can add crossed feature columns to the model:

education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

We can also create a crossed_column over more than two columns. Each constituent column can be either a base feature column that is categorical (SparseColumn), a bucketized real-valued feature column, or even another CrossColumn. For example:

age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)

These crossed columns always use hash buckets to avoid the exponential explosion in the number of categories, and put the control over number of model weights in the hands of the user.

For a visual example the effect of hash-buckets with crossed columns see this notebook

Define the logistic regression model

After processing the input data and defining all the feature columns, we can put them together and build a logistic regression model. The previous section showed several types of base and derived feature columns, including:

  • CategoricalColumn
  • NumericColumn
  • BucketizedColumn
  • CrossedColumn

All of these are subclasses of the abstract FeatureColumn class and can be added to the feature_columns field of a model:

import tempfile

base_columns = [
    education, marital_status, relationship, workclass, occupation,
    age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ['education', 'occupation'], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
]

model = tf.estimator.LinearClassifier(
    model_dir=tempfile.mkdtemp(),
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(learning_rate=0.1))

The model automatically learns a bias term, which controls the prediction made without observing any features. The learned model files are stored in model_dir.

Train and evaluate the model

After adding all the features to the model, let's train the model. Training a model is just a single command using the tf.estimator API:

train_inpf = functools.partial(census_dataset.input_fn, train_file,
                               num_epochs=40, shuffle=True, batch_size=64)

model.train(train_inpf)

clear_output()  # used for notebook display

After the model is trained, evaluate the accuracy of the model by predicting the labels of the holdout data:

results = model.evaluate(test_inpf)

clear_output()

for key,value in sorted(results.items()):
  print('%s: %0.2f' % (key, value))
accuracy: 0.83
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.64
precision: 0.68
prediction/mean: 0.24
recall: 0.56

The first line of the output should display something like: accuracy: 0.84, which means the accuracy is 84%. You can try using more features and transformations to see if you can do better!

After the model is evaluated, we can use it to predict whether an individual has an annual income of over 50,000 dollars given an individual's information input.

Let's look in more detail how the model performed:

import numpy as np

predict_df = test_df[:20].copy()

pred_iter = model.predict(
    lambda:easy_input_function(predict_df, label_key='income_bracket',
                               num_epochs=1, shuffle=False, batch_size=10))

classes = np.array(['<=50K', '>50K'])
pred_class_id = []

for pred_dict in pred_iter:
  pred_class_id.append(pred_dict['class_ids'])

predict_df['predicted_class'] = classes[np.array(pred_class_id)]
predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']

clear_output()

predict_df[['income_bracket','predicted_class', 'correct']]

For a working end-to-end example, download our example code and set the model_type flag to wide.

Adding Regularization to Prevent Overfitting

Regularization is a technique used to avoid overfitting. Overfitting happens when a model performs well on the data it is trained on, but worse on test data that the model has not seen before. Overfitting can occur when a model is excessively complex, such as having too many parameters relative to the number of observed training data. Regularization allows you to control the model's complexity and make the model more generalizable to unseen data.

You can add L1 and L2 regularizations to the model with the following code:

model_l1 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=10.0,
        l2_regularization_strength=0.0))

model_l1.train(train_inpf)

results = model_l1.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))
accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.47
precision: 0.69
prediction/mean: 0.24
recall: 0.55
model_l2 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=0.0,
        l2_regularization_strength=10.0))

model_l2.train(train_inpf)

results = model_l2.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))
accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.46
precision: 0.69
prediction/mean: 0.24
recall: 0.55

These regularized models don't perform much better than the base model. Let's look at the model's weight distributions to better see the effect of the regularization:

def get_flat_weights(model):
  weight_names = [
      name for name in model.get_variable_names()
      if "linear_model" in name and "Ftrl" not in name]

  weight_values = [model.get_variable_value(name) for name in weight_names]

  weights_flat = np.concatenate([item.flatten() for item in weight_values], axis=0)

  return weights_flat

weights_flat = get_flat_weights(model)
weights_flat_l1 = get_flat_weights(model_l1)
weights_flat_l2 = get_flat_weights(model_l2)

The models have many zero-valued weights caused by unused hash bins (there are many more hash bins than categories in some columns). We can mask these weights when viewing the weight distributions:

weight_mask = weights_flat != 0

weights_base = weights_flat[weight_mask]
weights_l1 = weights_flat_l1[weight_mask]
weights_l2 = weights_flat_l2[weight_mask]

Now plot the distributions:

plt.figure()
_ = plt.hist(weights_base, bins=np.linspace(-3,3,30))
plt.title('Base Model')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l1, bins=np.linspace(-3,3,30))
plt.title('L1 - Regularization')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l2, bins=np.linspace(-3,3,30))
plt.title('L2 - Regularization')
_=plt.ylim([0,500])

png

png

png

Both types of regularization squeeze the distribution of weights towards zero. L2 regularization has a greater effect in the tails of the distribution eliminating extreme weights. L1 regularization produces more exactly-zero values, in this case it sets ~200 to zero.