TensorFlow model训练 and optimization

1. 损失function选择

损失function is model训练 core, 它衡量model预测值 and 真实值之间 diff. 选择合适损失function for 于model训练至关 important .

1.1 回归task 损失function

for 于回归task, 常用损失functionincluding:

1.1.1 均方误差 (MSE)

均方误差 is 最常用回归损失function, 它计算预测值 and 真实值之间差值平方平均值:

MSE = (1/n) * Σ(y_pred - y_true)²

# usingMSE损失function
model.compile(optimizer='adam', loss='mean_squared_error')

1.1.2 平均绝 for 误差 (MAE)

平均绝 for 误差计算预测值 and 真实值之间差值绝 for 值平均值:

MAE = (1/n) * Σ|y_pred - y_true|

# usingMAE损失function
model.compile(optimizer='adam', loss='mean_absolute_error')

1.1.3 Huber损失

Huber损失 is MSE and MAE 结合, in 误差较 small 时usingMSE, in 误差较 big 时usingMAE, for exception值具 has 较 good 鲁棒性:

# usingHuber损失function
model.compile(optimizer='adam', loss=tf.keras.losses.Huber(delta=1.0))

1.2 classificationtask 损失function

for 于classificationtask, 常用损失functionincluding:

1.2.1 交叉熵损失

交叉熵损失 is classificationtaskin最常用损失function, 它衡量两个概率分布之间 diff.

二classification交叉熵

# using二classification交叉熵损失function
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

many classification交叉熵

# using many classification交叉熵损失function
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

稀疏classification交叉熵

当tag is 整数而不 is 独热编码时, using稀疏classification交叉熵:

# using稀疏classification交叉熵损失function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

2. optimization器选择 and configuration

optimization器 is 用于updatemodelparameter algorithms, 它根据损失function 梯度调整modelparameter, 以最 small 化损失function.

2.1 commonoptimization器

2.1.1 随机梯度 under 降 (SGD)

SGD is 最basic optimization器, 它using随机选择样本计算梯度:

# usingSGDoptimization器
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), loss='mse')

2.1.2 SGD with Momentum

带动量 SGD可以加速收敛, reducing震荡:

# using带动量 SGDoptimization器
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), loss='mse')

2.1.3 RMSprop

RMSprop自适应地调整每个parameter Learning率:

# usingRMSpropoptimization器
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001), loss='mse')

2.1.4 Adam

Adam结合了Momentum and RMSprop 优点, is 目 before 最常用 optimization器之一:

# usingAdamoptimization器
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')

2.1.5 AdamW

AdamW is Adam improvement版, 它将权重衰减 and 梯度update分离, 通常能获得更 good 结果:

# usingAdamWoptimization器
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01), loss='mse')

2.2 optimization器parameter调整

不同 optimization器 has 不同 parameter, 我们可以根据需要调整这些parameter:

# 自定义Adamoptimization器parameter
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001,  # Learning率
    beta_1=0.9,  # 一阶矩估计 指数衰减率
    beta_2=0.999,  # 二阶矩估计 指数衰减率
    epsilon=1e-07,  # 防止除零error  small 常数
    amsgrad=False  #  is 否usingAMSGrad变体
)

model.compile(optimizer=optimizer, loss='mse')

3. Learning率调整策略

Learning率 is 最 important 超parameter之一, 它决定了modelparameterupdate 步 long . 合适 Learning率调整策略可以加速model收敛, improvingmodelperformance.

3.1 固定Learning率

最 simple Learning率策略 is using固定Learning率:

# using固定Learning率
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')

3.2 Learning率scheduling器

TensorFlowproviding了 many 种Learning率scheduling器, 用于动态调整Learning率:

3.2.1 指数衰减

# 指数衰减Learning率
def exponential_decay(lr, epoch):
    decay_rate = 0.1
    decay_steps = 100
    return lr * (decay_rate ** (epoch / decay_steps))

# creationLearning率scheduling器
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay)

# 编译model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')

# 训练model, usingLearning率scheduling器
history = model.fit(X, y, epochs=100, callbacks=[lr_scheduler])

3.2.2 分段衰减

# 分段衰减Learning率
def step_decay(lr, epoch):
    drop_rate = 0.5
    epochs_drop = 10
    return lr * (drop_rate ** (epoch // epochs_drop))

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(step_decay)

3.2.3 余弦退火

余弦退火Learning率mock余弦function 衰减, in 训练 after 期Learning率 under 降得更 slow :

# using余弦退火Learning率
lr_scheduler = tf.keras.callbacks.CosineAnnealingLR(
    initial_learning_rate=0.001,
    steps_per_epoch=len(X) // batch_size,
    epochs=100
)

3.2.4 ReduceLROnPlateau

当verification损失停止 under 降时, 自动降 low Learning率:

# usingReduceLROnPlateau
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',  # monitor 指标
    factor=0.1,  # Learning率降 low 因子
    patience=10,  # how many个epoch损失不 under 降则降 low Learning率
    verbose=1,  # 输出Learning率调整information
    mode='min',  # 模式, min表示当monitor指标停止 under 降时调整
    min_delta=0.0001,  # 最 small 变化量
    cooldown=0,  # 调整 after etc.待how many个epoch再开始monitor
    min_lr=0  # 最 small Learning率
)

# 训练model, usingReduceLROnPlateau
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[reduce_lr]
)

3.3 Learning率find器

Learning率find器可以helping我们找 to 合适初始Learning率:

# Learning率find器
class LearningRateFinder:
    def __init__(self, model):
        self.model = model
        self.lrs = []
        self.losses = []
    
    def find(self, X, y, start_lr=1e-6, end_lr=1e-1, epochs=10, batch_size=32):
        # 编译model
        self.model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=start_lr), loss='mse')
        
        # 计算总步数
        steps_per_epoch = len(X) // batch_size
        total_steps = epochs * steps_per_epoch
        
        # Learning率增 long 因子
        lr_factor = (end_lr / start_lr) ** (1 / total_steps)
        
        # 自定义callback
        class LRFinderCallback(tf.keras.callbacks.Callback):
            def __init__(self, lrf):
                self.lrf = lrf
                self.current_lr = start_lr
            
            def on_train_batch_end(self, batch, logs=None):
                # 记录当 before Learning率 and 损失
                self.lrf.lrs.append(self.current_lr)
                self.lrf.losses.append(logs['loss'])
                
                # updateLearning率
                self.current_lr *= lr_factor
                self.model.optimizer.learning_rate.assign(self.current_lr)
        
        # 训练model
        self.model.fit(
            X, y,
            epochs=epochs,
            batch_size=batch_size,
            callbacks=[LRFinderCallback(self)],
            verbose=0
        )
    
    def plot(self):
        import matplotlib.pyplot as plt
        plt.figure(figsize=(10, 6))
        plt.plot(self.lrs, self.losses)
        plt.xscale('log')
        plt.xlabel('Learning Rate')
        plt.ylabel('Loss')
        plt.title('Learning Rate Finder')
        plt.show()

# usingLearning率find器
lrf = LearningRateFinder(model)
lrf.find(X, y, start_lr=1e-6, end_lr=1e-1, epochs=10)
lrf.plot()

4. 防止过拟合

过拟合 is 指model in 训练data on 表现很 good , 但 in new data on 表现很差现象. 以 under is 一些防止过拟合 techniques:

4.1 data增强

data增强 is 增加训练data many 样性 techniques, 可以reducing过拟合:

# graph像data增强
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
    tf.keras.layers.experimental.preprocessing.RandomRotation(0.1),
    tf.keras.layers.experimental.preprocessing.RandomZoom(0.1),
    tf.keras.layers.experimental.preprocessing.RandomContrast(0.1),
])

#  in modelinusingdata增强
model = tf.keras.Sequential([
    data_augmentation,
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.2 正则化

正则化 is through in 损失functionin添加惩罚项来防止model过拟合 techniques.

4.2.1 L1正则化

L1正则化 in 损失functionin添加权重绝 for 值之 and :

# usingL1正则化
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l1(0.01)),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.2.2 L2正则化

L2正则化 in 损失functionin添加权重平方 and :

# usingL2正则化
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.2.3 L1L2正则化

L1L2正则化 is L1 and L2正则化结合:

# usingL1L2正则化
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.01, l2=0.01)),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.3 Dropout

Dropout is a in 训练过程in随机discard神经元 techniques, 可以防止神经元之间过度依赖:

# usingDropout
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),  # Dropout层, discard50% 神经元
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.4 早停法

早停法 is 当verification损失停止 under 降时停止训练, 防止model过拟合:

# using早停法
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',  # monitor 指标
    patience=10,  # how many个epoch损失不 under 降则停止训练
    verbose=1,  # 输出停止information
    mode='min',  # 模式, min表示当monitor指标停止 under 降时停止
    restore_best_weights=True  #  is 否restore最佳权重
)

# 训练model, using早停法
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[early_stopping]
)

4.5 批量归一化

批量归一化可以加速model收敛, reducing过拟合:

# using批量归一化
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

4.6 model集成

model集成 is 将 many 个model 预测结果结合起来, 通常可以获得更 good performance, reducing过拟合:

#  simple  model集成example
# 训练 many 个model
models = []
for i in range(5):
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), verbose=0)
    models.append(model)

# 集成预测
def ensemble_predict(models, X):
    predictions = []
    for model in models:
        predictions.append(model.predict(X))
    # 取平均值
    return np.mean(predictions, axis=0)

# using集成model预测
y_pred = ensemble_predict(models, X_test)

5. model训练techniques

5.1 monitor训练过程

usingTensorBoardmonitor训练过程:

# usingTensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir='logs',  # logTable of Contents
    histogram_freq=1,  # 每1个epoch记录一次直方graph
    write_graph=True,  # 记录计算graph
    write_images=True,  # 记录权重直方graph
    update_freq='epoch'  # update频率
)

# 训练model, usingTensorBoard
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[tensorboard_callback]
)

启动TensorBoard:

tensorboard --logdir logs

5.2 保存最佳model

usingModelCheckpoint保存训练过程in 最佳model:

# 保存最佳model
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.h5',  # model保存path
    monitor='val_loss',  # monitor 指标
    verbose=1,  # 输出保存information
    save_best_only=True,  #  is 否只保存最佳model
    mode='min',  # 模式, min表示当monitor指标最 small 时保存
    save_weights_only=False,  #  is 否只保存权重
    save_freq='epoch'  # 保存频率
)

# 训练model, usingModelCheckpoint
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=100,
    callbacks=[model_checkpoint]
)

5.3 梯度裁剪

梯度裁剪可以防止梯度爆炸:

# using梯度裁剪
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0)  # 裁剪梯度, 使其最 big 值 for 1.0
model.compile(optimizer=optimizer, loss='mse')

or using全局范数裁剪:

# using全局范数裁剪
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)  # 裁剪梯度, 使其范数 for 1.0
model.compile(optimizer=optimizer, loss='mse')

5.4 using混合精度训练

混合精度训练using16位浮点数 and 32位浮点数混合计算, 可以加速训练, reducingmemoryusing:

# using混合精度训练
from tensorflow.keras import mixed_precision

# 设置混合精度策略
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

# 构建model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

# 编译model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# 训练model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val))

6. distributed训练

distributed训练 is using many 个设备 or 机器训练model techniques, 可以加速model训练.

6.1 dataparallel

dataparallel is 将data分成 many 个批次, in 不同设备 on parallel训练:

# usingMirroredStrategyfordistributed训练
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    #  in 策略范围 in 构建model
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    # 编译model
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# 训练model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val))

7. 练习

练习 1: optimizationmodel训练

构建一个 simple 神经networkmodel.
尝试不同损失function and optimization器, 比较它们效果.
implementationLearning率衰减策略, 观察model训练效果.
添加正则化, Dropout and 批量归一化, 防止model过拟合.
using早停法 and ModelCheckpoint保存最佳model.

练习 2: graph像classificationmodeloptimization

usingMNISTdata集构建一个graph像classificationmodel.
添加data增强, improvingmodelperformance.
尝试不同 modelarchitecture, such asCNN, ResNetetc..
usingLearning率find器找 to 最佳Learning率.
using早停法 and Learning率scheduling器optimization训练过程.

练习 3: 防止过拟合

构建一个 complex model, 使其 in 训练data on 过拟合.
尝试不同防止过拟合techniques, such asdata增强, 正则化, Dropoutetc..
比较不同techniques 效果, analysis它们 Pros and Cons.
usingTensorBoardmonitor不同model 训练过程.

on 一节: TensorFlow data加载 and 预processing under 一节: TensorFlow modelassessment and 保存