TensorFlow Probability 0.10 : ガイド : Optimizer (翻訳/解説)
翻訳 : (株)クラスキャット セールスインフォメーション
作成日時 : 06/19/2020 (0.10.0)
* 本ページは、TensorFlow Probability の以下のドキュメントを翻訳した上で適宜、補足説明したものです:
* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。
ガイド : Optimizer
Abstract
この colab では TensorFlow Probability で実装された様々な optimizer をどのように使用するかを実演します。
依存性 & 要件
#@title Import { display-mode: "form" } %matplotlib inline import contextlib import functools import os import time import numpy as np import pandas as pd import scipy as sp from six.moves import urllib from sklearn import preprocessing import tensorflow.compat.v2 as tf tf.enable_v2_behavior() import tensorflow_probability as tfp
BFGS と L-BFGS Optimizers
準ニュートン法はポピュラーな first order 最適化アルゴリズムのクラスです。これらのメソッドは探索方向を見つけるために正確なヘッセ (行列) への正定値近似を利用します。
Broyden-Fletcher-Goldfarb-Shanno アルゴリズム (BFGS) はこの一般的なアイデアの特定の実装です。それはミディアムサイズの問題のために適用可能で選択される方法です、そこでは勾配はどこでも連続です (e.g. $L_2$ ペナルティを持つ線形回帰)。
L-BFGS は BFGS の制限されたメモリバージョンで、そのヘッセ行列が合理的なコストで計算できないかスパースでないようなより大きな問題を解くために有用です。ヘッセ行列の完全密 $n \times n$ 近似をストアする代わりに、それらはこれらの近似を暗黙的に表す長さ $n$ の 2, 3 のベクトルをセーブするだけです。
#@title Helper functions CACHE_DIR = os.path.join(os.sep, 'tmp', 'datasets') def make_val_and_grad_fn(value_fn): @functools.wraps(value_fn) def val_and_grad(x): return tfp.math.value_and_gradient(value_fn, x) return val_and_grad @contextlib.contextmanager def timed_execution(): t0 = time.time() yield dt = time.time() - t0 print('Evaluation took: %f seconds' % dt) def np_value(tensor): """Get numpy value out of possibly nested tuple of tensors.""" if isinstance(tensor, tuple): return type(tensor)(*(np_value(t) for t in tensor)) else: return tensor.numpy() def run(optimizer): """Run an optimizer and measure it's evaluation time.""" optimizer() # Warmup. with timed_execution(): result = optimizer() return np_value(result)
単純な二次関数上の L-BFGS
# Fix numpy seed for reproducibility np.random.seed(12345) # The objective must be supplied as a function that takes a single # (Tensor) argument and returns a tuple. The first component of the # tuple is the value of the objective at the supplied point and the # second value is the gradient at the supplied point. The value must # be a scalar and the gradient must have the same shape as the # supplied argument. # The `make_val_and_grad_fn` decorator helps transforming a function # returning the objective value into one that returns both the gradient # and the value. It also works for both eager and graph mode. dim = 10 minimum = np.ones([dim]) scales = np.exp(np.random.randn(dim)) @make_val_and_grad_fn def quadratic(x): return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1) # The minimization routine also requires you to supply an initial # starting point for the search. For this example we choose a random # starting point. start = np.random.randn(dim) # Finally an optional argument called tolerance let's you choose the # stopping point of the search. The tolerance specifies the maximum # (supremum) norm of the gradient vector at which the algorithm terminates. # If you don't have a specific need for higher or lower accuracy, leaving # this parameter unspecified (and hence using the default value of 1e-8) # should be good enough. tolerance = 1e-10 @tf.function def quadratic_with_lbfgs(): return tfp.optimizer.lbfgs_minimize( quadratic, initial_position=tf.constant(start), tolerance=tolerance) results = run(quadratic_with_lbfgs) # The optimization results contain multiple pieces of information. The most # important fields are: 'converged' and 'position'. # Converged is a boolean scalar tensor. As the name implies, it indicates # whether the norm of the gradient at the final point was within tolerance. # Position is the location of the minimum found. It is important to check # that converged is True before using the value of the position. print('L-BFGS Results') print('Converged:', results.converged) print('Location of the minimum:', results.position) print('Number of iterations:', results.num_iterations)
Evaluation took: 0.014586 seconds L-BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
BFGS で同じ問題
@tf.function def quadratic_with_bfgs(): return tfp.optimizer.bfgs_minimize( quadratic, initial_position=tf.constant(start), tolerance=tolerance) results = run(quadratic_with_bfgs) print('BFGS Results') print('Converged:', results.converged) print('Location of the minimum:', results.position) print('Number of iterations:', results.num_iterations)
Evaluation took: 0.010468 seconds BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
L1 ペナルティを持つ線形回帰 : 前立腺がん
次の本からの例 : The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.
これは L1 ペナルティを持つ最適化問題であることに注意してください。
データセットを取得する
def cache_or_download_file(cache_dir, url_base, filename): """Read a cached file or download it.""" filepath = os.path.join(cache_dir, filename) if tf.io.gfile.exists(filepath): return filepath if not tf.io.gfile.exists(cache_dir): tf.io.gfile.makedirs(cache_dir) url = url_base + filename print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath)) urllib.request.urlretrieve(url, filepath) return filepath def get_prostate_dataset(cache_dir=CACHE_DIR): """Download the prostate dataset and read as Pandas dataframe.""" url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/' return pd.read_csv( cache_or_download_file(cache_dir, url_base, 'prostate.data'), delim_whitespace=True, index_col=0) prostate_df = get_prostate_dataset()
Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.
問題定義
np.random.seed(12345) feature_names = ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp', 'gleason', 'pgg45'] # Normalize features scalar = preprocessing.StandardScaler() prostate_df[feature_names] = pd.DataFrame( scalar.fit_transform( prostate_df[feature_names].astype('float64'))) # select training set prostate_df_train = prostate_df[prostate_df.train == 'T'] # Select features and labels features = prostate_df_train[feature_names] labels = prostate_df_train[['lpsa']] # Create tensors feat = tf.constant(features.values, dtype=tf.float64) lab = tf.constant(labels.values, dtype=tf.float64) dtype = feat.dtype regularization = 0 # regularization parameter dim = 8 # number of features # We pick a random starting point for the search start = np.random.randn(dim + 1) def regression_loss(params): """Compute loss for linear regression model with L1 penalty Args: params: A real tensor of shape [dim + 1]. The zeroth component is the intercept term and the rest of the components are the beta coefficients. Returns: The mean square error loss including L1 penalty. """ params = tf.squeeze(params) intercept, beta = params[0], params[1:] pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept mse_loss = tf.reduce_sum( tf.cast( tf.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64)) l1_penalty = regularization * tf.reduce_sum(tf.abs(beta)) total_loss = mse_loss + l1_penalty return total_loss
L-BFGS で解く
L-BFGS を使用して fit します。L1 ペナルティは導関数不連続性を導入しさえしますが、実際には、L-BFGS は依然として非常に上手く動作します。
@tf.function def l1_regression_with_lbfgs(): return tfp.optimizer.lbfgs_minimize( make_val_and_grad_fn(regression_loss), initial_position=tf.constant(start), tolerance=1e-8) results = run(l1_regression_with_lbfgs) minimum = results.position fitted_intercept = minimum[0] fitted_beta = minimum[1:] print('L-BFGS Results') print('Converged:', results.converged) print('Intercept: Fitted ({})'.format(fitted_intercept)) print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.017987 seconds L-BFGS Results Converged: True Intercept: Fitted (2.3879985744556484) Beta: Fitted [ 0.68626215 0.28193532 -0.17030254 0.10799274 0.33634988 -0.24888523 0.11992237 0.08689026]
Nelder Mead で解く
Nelder Mead 法 は最もポピュラーな導関数が不要な最小化法の一つです。この optimizer は勾配情報を利用しません、そしてターゲット関数の微分可能性について仮定しません ; 従ってそれは非滑らかな目的関数のために適切です、例えば L1 ペナルティを持つ最適化問題です。
$n$-次元の最適化問題についてそれは非退化単体に広がる (= span) $n+1$ 解のセットを維持します。それは続いて頂点の各々における関数値を使用して移動のセット (反射 (reflection)、膨張 (expansion)、縮小 (shrinkage) そして 収縮 (contraction)) に基づいて単体を修正します。
# Nelder mead expects an initial_vertex of shape [n + 1, 1]. initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1) @tf.function def l1_regression_with_nelder_mead(): return tfp.optimizer.nelder_mead_minimize( regression_loss, initial_vertex=initial_vertex, func_tolerance=1e-10, position_tolerance=1e-10) results = run(l1_regression_with_nelder_mead) minimum = results.position.reshape([-1]) fitted_intercept = minimum[0] fitted_beta = minimum[1:] print('Nelder Mead Results') print('Converged:', results.converged) print('Intercept: Fitted ({})'.format(fitted_intercept)) print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.325643 seconds Nelder Mead Results Converged: True Intercept: Fitted (2.387998456121595) Beta: Fitted [ 0.68626266 0.28193456 -0.17030291 0.10799375 0.33635132 -0.24888703 0.11992244 0.08689023]
L2 ペナルティを持つロジスティック回帰
この例のために、分類のための合成データセットを作成してパラメータを fit するために L-BFGS optimizer を使用します。
np.random.seed(12345) dim = 5 # The number of features n_obs = 10000 # The number of observations betas = np.random.randn(dim) # The true beta intercept = np.random.randn() # The true intercept features = np.random.randn(n_obs, dim) # The feature matrix probs = sp.special.expit( np.matmul(features, np.expand_dims(betas, -1)) + intercept) labels = sp.stats.bernoulli.rvs(probs) # The true labels regularization = 0.8 feat = tf.constant(features) lab = tf.constant(labels, dtype=feat.dtype) @make_val_and_grad_fn def negative_log_likelihood(params): """Negative log likelihood for logistic model with L2 penalty Args: params: A real tensor of shape [dim + 1]. The zeroth component is the intercept term and the rest of the components are the beta coefficients. Returns: The negative log likelihood plus the penalty term. """ intercept, beta = params[0], params[1:] logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits( labels=lab, logits=logit)) l2_penalty = regularization * tf.reduce_sum(beta ** 2) total_loss = log_likelihood + l2_penalty return total_loss start = np.random.randn(dim + 1) @tf.function def l2_regression_with_lbfgs(): return tfp.optimizer.lbfgs_minimize( negative_log_likelihood, initial_position=tf.constant(start), tolerance=1e-8) results = run(l2_regression_with_lbfgs) minimum = results.position fitted_intercept = minimum[0] fitted_beta = minimum[1:] print('Converged:', results.converged) print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept)) print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))
Evaluation took: 0.056751 seconds Converged: True Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904) Beta: Fitted [-0.18016612 0.53121578 -0.56420632 -0.5336374 2.00499675], Actual [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057]
バッチ処理サポート
BFGS と L-BFGS の両者はバッチ化計算をサポートします、例えば多くの異なる開始点から単一関数を ; あるいは単一ポイントから複数のパラメトリック関数を最適化します。
単一関数、マルチ開始点
Himmelblau 関数は標準的な最適化テストケースです。関数は次により与えられます :
$$f(x, y) = (x^2 + y – 11)^2 + (x + y^2 – 7)^2$$
関数は以下に位置する 4 つの最小値を持ちます :
- (3, 2),
- (-2.805118, 3.131312),
- (-3.779310, -3.283186),
- (3.584428, -1.848126).
これら総ての最小値は適切な開始点から到達されるかもしれません。
# The function to minimize must take as input a tensor of shape [..., n]. In # this n=2 is the size of the domain of the input and [...] are batching # dimensions. The return value must be of shape [...], i.e. a batch of scalars # with the objective value of the function evaluated at each input point. @make_val_and_grad_fn def himmelblau(coord): x, y = coord[..., 0], coord[..., 1] return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2 starts = tf.constant([[1, 1], [-2, 2], [-1, -1], [1, -2]], dtype='float64') # The stopping_condition allows to further specify when should the search stop. # The default, tfp.optimizer.converged_all, will proceed until all points have # either converged or failed. There is also a tfp.optimizer.converged_any to # stop as soon as the first point converges, or all have failed. @tf.function def batch_multiple_starts(): return tfp.optimizer.lbfgs_minimize( himmelblau, initial_position=starts, stopping_condition=tfp.optimizer.converged_all, tolerance=1e-8) results = run(batch_multiple_starts) print('Converged:', results.converged) print('Minima:', results.position)
Evaluation took: 0.019095 seconds Converged: [ True True True True] Minima: [[ 3. 2. ] [-2.80511809 3.13131252] [-3.77931025 -3.28318599] [ 3.58442834 -1.84812653]]
複数の関数
デモ目的のため、この例では多くの高次元のランダムに生成された二次 bowls を同時に最適化します。
np.random.seed(12345) dim = 100 batches = 500 minimum = np.random.randn(batches, dim) scales = np.exp(np.random.randn(batches, dim)) @make_val_and_grad_fn def quadratic(x): return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1) # Make all starting points (1, 1, ..., 1). Note not all starting points need # to be the same. start = tf.ones((batches, dim), dtype='float64') @tf.function def batch_multiple_functions(): return tfp.optimizer.lbfgs_minimize( quadratic, initial_position=start, stopping_condition=tfp.optimizer.converged_all, max_iterations=100, tolerance=1e-8) results = run(batch_multiple_functions) print('All converged:', np.all(results.converged)) print('Largest error:', np.max(results.position - minimum))
Evaluation took: 1.994132 seconds All converged: True Largest error: 4.4131473142527966e-08
以上