TensorFlow確率のオプティマイザー

TensorFlow.orgで表示 GoogleColabで実行 GitHubでソースを表示 ノートブックをダウンロードする

概要

このコラボでは、TensorFlowProbabilityに実装されているさまざまなオプティマイザーの使用方法を示します。

依存関係と前提条件

輸入

BFGSおよびL-BFGSオプティマイザー

準ニュートン法は、一般的な1次最適化アルゴリズムのクラスです。これらの方法では、正確なヘッセ行列の正定値近似を使用して、検索方向を見つけます。

Broydenフレッチャー-ゴールドファーブ-Shannoアルゴリズム( BFGSが)、この概念の特定の実装です。これは適用可能であり、勾配がどこでも連続している中規模の問題(たとえば、$ L_2 $ペナルティを伴う線形回帰)に最適な方法です。

L-BFGSは、そのヘッセ行列妥当なコストで計算されていないか、疎でないことができ、より大きな問題を解決するのに有用であるBFGSの限られたメモリのバージョンです。ヘッセ行列の完全に密な$ n \ times n $近似を格納する代わりに、これらの近似を暗黙的に表す長さ$ n $のいくつかのベクトルのみを保存します。

ヘルパー関数

単純な2次関数のL-BFGS

# Fix numpy seed for reproducibility
np.random.seed(12345)

# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.

# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.

dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)

# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)

# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10

@tf.function
def quadratic_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_lbfgs)

# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.

print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.014586 seconds
L-BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

BFGSと同じ問題

@tf.function
def quadratic_with_bfgs():
  return tfp.optimizer.bfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_bfgs)

print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.010468 seconds
BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

L1ペナルティを伴う線形回帰:前立腺がんデータ

統計的学習、データマイニング、推論、およびトレバー・ハスティー、ロバート・ティブシャーアーニとジェローム・フリードマンによる予測の要素:ブックから例。

これはL1ペナルティを伴う最適化問題であることに注意してください。

データセットを取得する

def cache_or_download_file(cache_dir, url_base, filename):
  """Read a cached file or download it."""
  filepath = os.path.join(cache_dir, filename)
  if tf.io.gfile.exists(filepath):
    return filepath
  if not tf.io.gfile.exists(cache_dir):
    tf.io.gfile.makedirs(cache_dir)
  url = url_base + filename
  print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
  urllib.request.urlretrieve(url, filepath)
  return filepath

def get_prostate_dataset(cache_dir=CACHE_DIR):
  """Download the prostate dataset and read as Pandas dataframe."""
  url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
  return pd.read_csv(
      cache_or_download_file(cache_dir, url_base, 'prostate.data'),
      delim_whitespace=True, index_col=0)

prostate_df = get_prostate_dataset()
Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.

問題の定義

np.random.seed(12345)

feature_names = ['lcavol', 'lweight',   'age',  'lbph', 'svi', 'lcp',   
                 'gleason', 'pgg45']

# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
    scalar.fit_transform(
        prostate_df[feature_names].astype('float64')))

# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T'] 

# Select features and labels 
features = prostate_df_train[feature_names]
labels =  prostate_df_train[['lpsa']]

# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)

dtype = feat.dtype

regularization = 0 # regularization parameter
dim = 8 # number of features

# We pick a random starting point for the search
start = np.random.randn(dim + 1)

def regression_loss(params):
  """Compute loss for linear regression model with L1 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The mean square error loss including L1 penalty.
  """
  params = tf.squeeze(params)
  intercept, beta  = params[0], params[1:]
  pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
  mse_loss = tf.reduce_sum(
      tf.cast(
        tf.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
  l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
  total_loss = mse_loss + l1_penalty
  return total_loss

L-BFGSで解く

L-BFGSを使用してフィットします。 L1ペナルティは導関数の不連続性をもたらしますが、実際には、L-BFGSは依然として非常にうまく機能します。

@tf.function
def l1_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    make_val_and_grad_fn(regression_loss),
    initial_position=tf.constant(start),
    tolerance=1e-8)

results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))
Evaluation took: 0.017987 seconds
L-BFGS Results
Converged: True
Intercept: Fitted (2.3879985744556484)
Beta:      Fitted [ 0.68626215  0.28193532 -0.17030254  0.10799274  0.33634988 -0.24888523
  0.11992237  0.08689026]

ネルダーミードで解く

ネルダミードの方法は、最も人気のある派生無料の最小化法の一つです。このオプティマイザーは勾配情報を使用せず、ターゲット関数の微分可能性についての仮定を行いません。したがって、L1ペナルティを伴う最適化問題などの滑らかでない目的関数に適しています。

$ n $次元の最適化問題では、非縮退シンプレックスにまたがる一連の$ n + 1 $候補解を維持します。各頂点の関数値を使用して、一連の移動(反射、拡張、収縮、収縮)に基づいてシンプレックスを連続的に変更します。

# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)

@tf.function
def l1_regression_with_nelder_mead():
  return tfp.optimizer.nelder_mead_minimize(
      regression_loss,
      initial_vertex=initial_vertex,
      func_tolerance=1e-10,
      position_tolerance=1e-10)

results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))
Evaluation took: 0.325643 seconds
Nelder Mead Results
Converged: True
Intercept: Fitted (2.387998456121595)
Beta:      Fitted [ 0.68626266  0.28193456 -0.17030291  0.10799375  0.33635132 -0.24888703
  0.11992244  0.08689023]

L2ペナルティを伴うロジスティック回帰

この例では、分類用の合成データセットを作成し、L-BFGSオプティマイザーを使用してパラメーターを適合させます。

np.random.seed(12345)

dim = 5  # The number of features
n_obs = 10000  # The number of observations

betas = np.random.randn(dim)  # The true beta
intercept = np.random.randn()  # The true intercept

features = np.random.randn(n_obs, dim)  # The feature matrix
probs = sp.special.expit(
    np.matmul(features, np.expand_dims(betas, -1)) + intercept)

labels = sp.stats.bernoulli.rvs(probs)  # The true labels

regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)

@make_val_and_grad_fn
def negative_log_likelihood(params):
  """Negative log likelihood for logistic model with L2 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The negative log likelihood plus the penalty term. 
  """
  intercept, beta  = params[0], params[1:]
  logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
  log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
      labels=lab, logits=logit))
  l2_penalty = regularization * tf.reduce_sum(beta ** 2)
  total_loss = log_likelihood + l2_penalty
  return total_loss

start = np.random.randn(dim + 1)

@tf.function
def l2_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
      negative_log_likelihood,
      initial_position=tf.constant(start),
      tolerance=1e-8)

results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))
Evaluation took: 0.056751 seconds
Converged: True
Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904)
Beta:
    Fitted [-0.18016612  0.53121578 -0.56420632 -0.5336374   2.00499675],
    Actual [-0.20470766  0.47894334 -0.51943872 -0.5557303   1.96578057]

バッチ処理のサポート

BFGSとL-BFGSはどちらも、たとえば、多くの異なる開始点から単一の関数を最適化するために、バッチ計算をサポートしています。または単一のポイントからの複数のパラメトリック関数。

単一の機能、複数の開始点

Himmelblauの関数は、標準の最適化テストケースです。関数は次のように与えられます。

$$f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2$$

この関数には、次の場所に4つの最小値があります。

  • (3、2)、
  • (-2.805118、3.131312)、
  • (-3.779310、-3.283186)、
  • (3.584428、-1.848126)。

これらの最小値はすべて、適切な開始点から到達できます。

# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.

@make_val_and_grad_fn
def himmelblau(coord):
  x, y = coord[..., 0], coord[..., 1]
  return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2

starts = tf.constant([[1, 1],
                      [-2, 2],
                      [-1, -1],
                      [1, -2]], dtype='float64')

# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.

@tf.function
def batch_multiple_starts():
  return tfp.optimizer.lbfgs_minimize(
      himmelblau, initial_position=starts,
      stopping_condition=tfp.optimizer.converged_all,
      tolerance=1e-8)

results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)
Evaluation took: 0.019095 seconds
Converged: [ True  True  True  True]
Minima: [[ 3.          2.        ]
 [-2.80511809  3.13131252]
 [-3.77931025 -3.28318599]
 [ 3.58442834 -1.84812653]]

複数の機能

デモンストレーションの目的で、この例では、ランダムに生成された多数の高次元の2次ボウルを同時に最適化します。

np.random.seed(12345)

dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)

# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')

@tf.function
def batch_multiple_functions():
  return tfp.optimizer.lbfgs_minimize(
      quadratic, initial_position=start,
      stopping_condition=tfp.optimizer.converged_all,
      max_iterations=100,
      tolerance=1e-8)

results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))
Evaluation took: 1.994132 seconds
All converged: True
Largest error: 4.4131473142527966e-08