1. 损失function选择
损失function is model训练 core, 它衡量model预测值 and 真实值之间 diff. 选择合适 损失function for 于model训练至关 important .
1.1 回归task 损失function
for 于回归task, 常用 损失functionincluding:
1.1.1 均方误差 (MSE)
均方误差 is 最常用 回归损失function, 它计算预测值 and 真实值之间差值 平方 平均值:
MSE = (1/n) * Σ(y_pred - y_true)²
# usingMSE损失function
model.compile(optimizer='adam', loss='mean_squared_error')
1.1.2 平均绝 for 误差 (MAE)
平均绝 for 误差计算预测值 and 真实值之间差值 绝 for 值 平均值:
MAE = (1/n) * Σ|y_pred - y_true|
# usingMAE损失function
model.compile(optimizer='adam', loss='mean_absolute_error')
1.1.3 Huber损失
Huber损失 is MSE and MAE 结合, in 误差较 small 时usingMSE, in 误差较 big 时usingMAE, for exception值具 has 较 good 鲁棒性:
# usingHuber损失function
model.compile(optimizer='adam', loss=tf.keras.losses.Huber(delta=1.0))
1.2 classificationtask 损失function
for 于classificationtask, 常用 损失functionincluding:
1.2.1 交叉熵损失
交叉熵损失 is classificationtaskin最常用 损失function, 它衡量两个概率分布之间 diff.
二classification交叉熵
# using二classification交叉熵损失function
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
many classification交叉熵
# using many classification交叉熵损失function
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
稀疏classification交叉熵
当tag is 整数而不 is 独热编码时, using稀疏classification交叉熵:
# using稀疏classification交叉熵损失function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
2. optimization器选择 and configuration
optimization器 is 用于updatemodelparameter algorithms, 它根据损失function 梯度调整modelparameter, 以最 small 化损失function.
2.1 commonoptimization器
2.1.1 随机梯度 under 降 (SGD)
SGD is 最basic optimization器, 它using随机选择 样本计算梯度:
# usingSGDoptimization器
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), loss='mse')
2.1.2 SGD with Momentum
带动量 SGD可以加速收敛, reducing震荡:
# using带动量 SGDoptimization器
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), loss='mse')
2.1.3 RMSprop
RMSprop自适应地调整每个parameter Learning率:
# usingRMSpropoptimization器
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.001), loss='mse')
2.1.4 Adam
Adam结合了Momentum and RMSprop 优点, is 目 before 最常用 optimization器之一:
# usingAdamoptimization器
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')
2.1.5 AdamW
AdamW is Adam improvement版, 它将权重衰减 and 梯度update分离, 通常能获得更 good 结果:
# usingAdamWoptimization器
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01), loss='mse')
2.2 optimization器parameter调整
不同 optimization器 has 不同 parameter, 我们可以根据需要调整这些parameter:
# 自定义Adamoptimization器parameter
optimizer = tf.keras.optimizers.Adam(
learning_rate=0.001, # Learning率
beta_1=0.9, # 一阶矩估计 指数衰减率
beta_2=0.999, # 二阶矩估计 指数衰减率
epsilon=1e-07, # 防止除零error small 常数
amsgrad=False # is 否usingAMSGrad变体
)
model.compile(optimizer=optimizer, loss='mse')
3. Learning率调整策略
Learning率 is 最 important 超parameter之一, 它决定了modelparameterupdate 步 long . 合适 Learning率调整策略可以加速model收敛, improvingmodelperformance.
3.1 固定Learning率
最 simple Learning率策略 is using固定Learning率:
# using固定Learning率
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')
3.2 Learning率scheduling器
TensorFlowproviding了 many 种Learning率scheduling器, 用于动态调整Learning率:
3.2.1 指数衰减
# 指数衰减Learning率
def exponential_decay(lr, epoch):
decay_rate = 0.1
decay_steps = 100
return lr * (decay_rate ** (epoch / decay_steps))
# creationLearning率scheduling器
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay)
# 编译model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss='mse')
# 训练model, usingLearning率scheduling器
history = model.fit(X, y, epochs=100, callbacks=[lr_scheduler])
3.2.2 分段衰减
# 分段衰减Learning率
def step_decay(lr, epoch):
drop_rate = 0.5
epochs_drop = 10
return lr * (drop_rate ** (epoch // epochs_drop))
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(step_decay)
3.2.3 余弦退火
余弦退火Learning率mock余弦function 衰减, in 训练 after 期Learning率 under 降得更 slow :
# using余弦退火Learning率
lr_scheduler = tf.keras.callbacks.CosineAnnealingLR(
initial_learning_rate=0.001,
steps_per_epoch=len(X) // batch_size,
epochs=100
)
3.2.4 ReduceLROnPlateau
当verification损失停止 under 降时, 自动降 low Learning率:
# usingReduceLROnPlateau
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', # monitor 指标
factor=0.1, # Learning率降 low 因子
patience=10, # how many个epoch损失不 under 降则降 low Learning率
verbose=1, # 输出Learning率调整information
mode='min', # 模式, min表示当monitor指标停止 under 降时调整
min_delta=0.0001, # 最 small 变化量
cooldown=0, # 调整 after etc.待how many个epoch再开始monitor
min_lr=0 # 最 small Learning率
)
# 训练model, usingReduceLROnPlateau
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
callbacks=[reduce_lr]
)
3.3 Learning率find器
Learning率find器可以helping我们找 to 合适 初始Learning率:
# Learning率find器
class LearningRateFinder:
def __init__(self, model):
self.model = model
self.lrs = []
self.losses = []
def find(self, X, y, start_lr=1e-6, end_lr=1e-1, epochs=10, batch_size=32):
# 编译model
self.model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=start_lr), loss='mse')
# 计算总步数
steps_per_epoch = len(X) // batch_size
total_steps = epochs * steps_per_epoch
# Learning率增 long 因子
lr_factor = (end_lr / start_lr) ** (1 / total_steps)
# 自定义callback
class LRFinderCallback(tf.keras.callbacks.Callback):
def __init__(self, lrf):
self.lrf = lrf
self.current_lr = start_lr
def on_train_batch_end(self, batch, logs=None):
# 记录当 before Learning率 and 损失
self.lrf.lrs.append(self.current_lr)
self.lrf.losses.append(logs['loss'])
# updateLearning率
self.current_lr *= lr_factor
self.model.optimizer.learning_rate.assign(self.current_lr)
# 训练model
self.model.fit(
X, y,
epochs=epochs,
batch_size=batch_size,
callbacks=[LRFinderCallback(self)],
verbose=0
)
def plot(self):
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(self.lrs, self.losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('Learning Rate Finder')
plt.show()
# usingLearning率find器
lrf = LearningRateFinder(model)
lrf.find(X, y, start_lr=1e-6, end_lr=1e-1, epochs=10)
lrf.plot()
4. 防止过拟合
过拟合 is 指model in 训练data on 表现很 good , 但 in new data on 表现很差 现象. 以 under is 一些防止过拟合 techniques:
4.1 data增强
data增强 is 增加训练data many 样性 techniques, 可以reducing过拟合:
# graph像data增强
data_augmentation = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.RandomFlip('horizontal'),
tf.keras.layers.experimental.preprocessing.RandomRotation(0.1),
tf.keras.layers.experimental.preprocessing.RandomZoom(0.1),
tf.keras.layers.experimental.preprocessing.RandomContrast(0.1),
])
# in modelinusingdata增强
model = tf.keras.Sequential([
data_augmentation,
tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D((2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(10, activation='softmax')
])
4.2 正则化
正则化 is through in 损失functionin添加惩罚项来防止model过拟合 techniques.
4.2.1 L1正则化
L1正则化 in 损失functionin添加权重 绝 for 值之 and :
# usingL1正则化
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l1(0.01)),
tf.keras.layers.Dense(10, activation='softmax')
])
4.2.2 L2正则化
L2正则化 in 损失functionin添加权重 平方 and :
# usingL2正则化
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)),
tf.keras.layers.Dense(10, activation='softmax')
])
4.2.3 L1L2正则化
L1L2正则化 is L1 and L2正则化 结合:
# usingL1L2正则化
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l1_l2(l1=0.01, l2=0.01)),
tf.keras.layers.Dense(10, activation='softmax')
])
4.3 Dropout
Dropout is a in 训练过程in随机discard神经元 techniques, 可以防止神经元之间 过度依赖:
# usingDropout
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5), # Dropout层, discard50% 神经元
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
4.4 早停法
早停法 is 当verification损失停止 under 降时停止训练, 防止model过拟合:
# using早停法
early_stopping = tf.keras.callbacks.EarlyStopping(
monitor='val_loss', # monitor 指标
patience=10, # how many个epoch损失不 under 降则停止训练
verbose=1, # 输出停止information
mode='min', # 模式, min表示当monitor指标停止 under 降时停止
restore_best_weights=True # is 否restore最佳权重
)
# 训练model, using早停法
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
callbacks=[early_stopping]
)
4.5 批量归一化
批量归一化可以加速model收敛, reducing过拟合:
# using批量归一化
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(10, activation='softmax')
])
4.6 model集成
model集成 is 将 many 个model 预测结果结合起来, 通常可以获得更 good performance, reducing过拟合:
# simple model集成example
# 训练 many 个model
models = []
for i in range(5):
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), verbose=0)
models.append(model)
# 集成预测
def ensemble_predict(models, X):
predictions = []
for model in models:
predictions.append(model.predict(X))
# 取平均值
return np.mean(predictions, axis=0)
# using集成model预测
y_pred = ensemble_predict(models, X_test)
5. model训练techniques
5.1 monitor训练过程
usingTensorBoardmonitor训练过程:
# usingTensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir='logs', # logTable of Contents
histogram_freq=1, # 每1个epoch记录一次直方graph
write_graph=True, # 记录计算graph
write_images=True, # 记录权重直方graph
update_freq='epoch' # update频率
)
# 训练model, usingTensorBoard
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
callbacks=[tensorboard_callback]
)
启动TensorBoard:
tensorboard --logdir logs
5.2 保存最佳model
usingModelCheckpoint保存训练过程in 最佳model:
# 保存最佳model
model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
filepath='best_model.h5', # model保存path
monitor='val_loss', # monitor 指标
verbose=1, # 输出保存information
save_best_only=True, # is 否只保存最佳model
mode='min', # 模式, min表示当monitor指标最 small 时保存
save_weights_only=False, # is 否只保存权重
save_freq='epoch' # 保存频率
)
# 训练model, usingModelCheckpoint
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
callbacks=[model_checkpoint]
)
5.3 梯度裁剪
梯度裁剪可以防止梯度爆炸:
# using梯度裁剪
optimizer = tf.keras.optimizers.Adam(clipvalue=1.0) # 裁剪梯度, 使其最 big 值 for 1.0
model.compile(optimizer=optimizer, loss='mse')
or using全局范数裁剪:
# using全局范数裁剪
optimizer = tf.keras.optimizers.Adam(clipnorm=1.0) # 裁剪梯度, 使其范数 for 1.0
model.compile(optimizer=optimizer, loss='mse')
5.4 using混合精度训练
混合精度训练using16位浮点数 and 32位浮点数混合计算, 可以加速训练, reducingmemoryusing:
# using混合精度训练
from tensorflow.keras import mixed_precision
# 设置混合精度策略
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
# 构建model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# 编译model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 训练model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val))
6. distributed训练
distributed训练 is using many 个设备 or 机器训练model techniques, 可以加速model训练.
6.1 dataparallel
dataparallel is 将data分成 many 个批次, in 不同设备 on parallel训练:
# usingMirroredStrategyfordistributed训练
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
# in 策略范围 in 构建model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# 编译model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 训练model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_val, y_val))
7. 练习
练习 1: optimizationmodel训练
- 构建一个 simple 神经networkmodel.
- 尝试不同 损失function and optimization器, 比较它们 效果.
- implementationLearning率衰减策略, 观察model训练效果.
- 添加正则化, Dropout and 批量归一化, 防止model过拟合.
- using早停法 and ModelCheckpoint保存最佳model.
练习 2: graph像classificationmodeloptimization
- usingMNISTdata集构建一个graph像classificationmodel.
- 添加data增强, improvingmodelperformance.
- 尝试不同 modelarchitecture, such asCNN, ResNetetc..
- usingLearning率find器找 to 最佳Learning率.
- using早停法 and Learning率scheduling器optimization训练过程.
练习 3: 防止过拟合
- 构建一个 complex model, 使其 in 训练data on 过拟合.
- 尝试不同 防止过拟合techniques, such asdata增强, 正则化, Dropoutetc..
- 比较不同techniques 效果, analysis它们 Pros and Cons.
- usingTensorBoardmonitor不同model 训练过程.