1. 自动微分 concepts
自动微分 (Automatic Differentiation, 简称AD) is a计算function导数 techniques, 它可以精确 high 效地计算 complex function 梯度. in 深度Learningin, 自动微分 is 训练model coretechniques, 因 for 它允许我们计算损失function相 for 于modelparameter 梯度, from 而using梯度 under 降algorithmsoptimizationmodel.
自动微分 and 数值微分 and 符号微分相比, 具 has 以 under 优势:
- 精度 high : 数值微分存 in 截断误差, 而自动微分可以得 to 精确 梯度值.
- efficiency high : 符号微分可能会产生 complex 表达式, 而自动微分可以 high 效地计算梯度.
- 适用范围广: 自动微分可以processing各种 complex function, including条件branch, 循环 and 递归etc..
2. TensorFlow in 自动微分
TensorFlow 2.x providing了 tf.GradientTape API 用于自动微分. tf.GradientTape is a 磁带式自动微分tool, 它可以记录计算过程, 然 after 反向播放计算梯度.
2.1 tf.GradientTape basicusing
using tf.GradientTape basic步骤 is :
- creation一个
tf.GradientTapeobject - in 磁带 on under 文in执行正向计算
- using磁带计算梯度
import tensorflow as tf
# creation一个variable
tx = tf.Variable(3.0)
# using tf.GradientTape 记录计算过程
with tf.GradientTape() as tape:
y = tx ** 2
# 计算 y 相 for 于 tx 梯度
grad = tape.gradient(y, tx)
print("y = tx ** 2, tx = 3.0")
print("dy/dtx =", grad.numpy()) # 输出: dy/dtx = 6.0
2.2 梯度计算 原理
当我们 in tf.GradientTape on under 文in执行operation时, TensorFlow 会记录所 has operation 梯度function. 这些梯度function会被添加 to 一个计算graphin, 当调用 tape.gradient() 时, TensorFlow 会沿着计算graph反向传播, 计算每个variable 梯度.
提示
默认circumstances under , tf.GradientTape 只会记录 for variable (tf.Variable) operation. such as果要记录 for 常量 or 张量 operation, 需要using tape.watch() method.
2.3 记录常量 梯度
such as果要计算常量 梯度, 需要using tape.watch() method:
# using常量计算梯度
x = tf.constant(3.0)
with tf.GradientTape() as tape:
# 记录常量 x
tape.watch(x)
y = x ** 2
grad = tape.gradient(y, x)
print("y = x ** 2, x = 3.0")
print("dy/dx =", grad.numpy()) # 输出: dy/dx = 6.0
3. tf.GradientTape advanced用法
3.1 持久化磁带
默认circumstances under , tf.GradientTape 只能调用一次 gradient() method, 之 after 会自动释放resource. such as果需要 many 次计算梯度, 可以creation一个持久化磁带:
x = tf.Variable(3.0)
# creation持久化磁带
with tf.GradientTape(persistent=True) as tape:
y = x ** 2
z = y ** 2
# 计算 z 相 for 于 y 梯度
dz_dy = tape.gradient(z, y) # 输出: dz/dy = 2 * y = 18.0
print("dz/dy =", dz_dy.numpy())
# 计算 z 相 for 于 x 梯度
dz_dx = tape.gradient(z, x) # 输出: dz/dx = 2 * y * 2 * x = 108.0
print("dz/dx =", dz_dx.numpy())
# 手动释放resource
del tape
3.2 high 阶导数
using嵌套 tf.GradientTape 可以计算 high 阶导数:
x = tf.Variable(3.0)
with tf.GradientTape() as outer_tape:
with tf.GradientTape() as inner_tape:
y = x ** 3
# 一阶导数
dy_dx = inner_tape.gradient(y, x)
# 二阶导数
d2y_dx2 = outer_tape.gradient(dy_dx, x)
print("y = x ** 3, x = 3.0")
print("一阶导数 dy/dx =", dy_dx.numpy()) # 输出: dy/dx = 27.0
print("二阶导数 d²y/dx² =", d2y_dx2.numpy()) # 输出: d²y/dx² = 18.0
3.3 计算 many 个variable 梯度
tf.GradientTape 可以同时计算 many 个variable 梯度:
w = tf.Variable(3.0)
b = tf.Variable(1.0)
x = tf.constant(2.0)
y = w * x + b
with tf.GradientTape() as tape:
y = w * x + b
# 同时计算 y 相 for 于 w and b 梯度
dy_dw, dy_db = tape.gradient(y, [w, b])
print("y = w * x + b, w=3.0, b=1.0, x=2.0")
print("dy/dw =", dy_dw.numpy()) # 输出: dy/dw = 2.0
print("dy/db =", dy_db.numpy()) # 输出: dy/db = 1.0
4. using自动微分formodel训练
自动微分 主要application is 训练深度Learningmodel. under 面我们using tf.GradientTape 训练一个 simple 线性回归model.
4.1 准备data
# 生成mockdata
X = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0])
y = tf.constant([3.0, 5.0, 7.0, 9.0, 11.0])
# 我们 model is y = w * X + b, 需要Learning w and b
w = tf.Variable(initial_value=0.0)
b = tf.Variable(initial_value=0.0)
4.2 定义损失function
using均方误差 (MSE) serving as损失function:
def compute_loss(y_pred, y_true):
return tf.reduce_mean(tf.square(y_pred - y_true))
4.3 训练model
# 设置训练parameter
epochs = 1000
learning_rate = 0.01
# 训练循环
for epoch in range(epochs):
with tf.GradientTape() as tape:
# before 向传播
y_pred = w * X + b
# 计算损失
loss = compute_loss(y_pred, y)
# 计算梯度
dw, db = tape.gradient(loss, [w, b])
# updateparameter
w.assign_sub(learning_rate * dw)
b.assign_sub(learning_rate * db)
# 每 100 个 epoch 打印一次损失
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1}: loss = {loss.numpy():.4f}, w = {w.numpy():.4f}, b = {b.numpy():.4f}")
# 输出最终结果
print(f"\n最终结果: w = {w.numpy():.4f}, b = {b.numpy():.4f}")
print(f"预测值: w * X + b = {w.numpy()} * X + {b.numpy()}")
run结果应该class似于:
Epoch 100: loss = 0.0018, w = 2.0236, b = 0.9313 Epoch 200: loss = 0.0007, w = 2.0135, b = 0.9605 Epoch 300: loss = 0.0003, w = 2.0077, b = 0.9776 Epoch 400: loss = 0.0001, w = 2.0044, b = 0.9873 Epoch 500: loss = 0.0000, w = 2.0025, b = 0.9928 Epoch 600: loss = 0.0000, w = 2.0014, b = 0.9959 Epoch 700: loss = 0.0000, w = 2.0008, b = 0.9977 Epoch 800: loss = 0.0000, w = 2.0005, b = 0.9987 Epoch 900: loss = 0.0000, w = 2.0003, b = 0.9992 Epoch 1000: loss = 0.0000, w = 2.0001, b = 0.9996 最终结果: w = 2.0001, b = 0.9996 预测值: w * X + b = 2.0001 * X + 0.9996
可以看 to , model成功Learning to 了 w ≈ 2 and b ≈ 1, and 我们 真实model y = 2x + 1 非常接近.
5. 梯度 under 降 原理
梯度 under 降 is aoptimizationalgorithms, 用于寻找function 最 small 值. in 深度Learningin, 我们using梯度 under 降来最 small 化损失function, from 而找 to 最优 modelparameter.
5.1 梯度 under 降 数学原理
梯度 under 降 basic思想 is : such as果我们想找 to function 最 small 值, 我们应该沿着function梯度 负方向 before 进. 因 for 梯度表示function on 升最 fast 方向, 所以负梯度方向就 is function under 降最 fast 方向.
for 于function f(x), 梯度 under 降 update公式 for :
x_new = x_old - learning_rate * ∇f(x_old)
其in:
x_oldis 当 before parameter值learning_rateis Learning率, 控制每次update 步 long∇f(x_old)is function inx_old处 梯度x_newis update after parameter值
5.2 Learning率 选择
Learning率 is 梯度 under 降 一个 important 超parameter, 它决定了modelparameterupdate 步 long . Learning率 选择会影响model 训练效果:
- Learning率过 small : model训练速度 slow , 需要更 many iteration次数.
- Learning率过 big : 可能会导致model不 stable , 甚至发散.
- 合适 Learning率: 可以 in 较 few iteration次数 in 收敛 to 最优解.
提示
in practicalapplicationin, 通常会usingLearning率衰减策略, 随着训练 for逐渐减 small Learning率, 这样可以 in 训练初期 fast 速收敛, in 训练 after 期精细调整parameter.
6. 自动微分 application场景
自动微分 in 深度Learningin has 广泛 application, 主要including:
6.1 model训练
自动微分 is 深度Learningmodel训练 coretechniques, 它允许我们 high 效地计算损失function相 for 于modelparameter 梯度, from 而using梯度 under 降algorithmsoptimizationmodel.
6.2 梯度check
自动微分可以用于梯度check, verification手动计算 梯度 is 否正确. in implementation自定义损失function or 层时, 梯度check is a important debugtool.
6.3 可解释性analysis
自动微分可以用于计算输入特征 important 性, helping我们understandingmodel 决策过程, improvingmodel 可解释性.
6.4 optimizationissues
自动微分可以用于解决各种optimizationissues, 不仅限于深度Learningmodel训练. 例such as, in 强化Learning, 推荐system, 自然languageprocessing and other fields, 自动微分都 has important application.
7. commonissues and solution
7.1 梯度消失 and 梯度爆炸
in 深度神经networkin, 梯度可能会随着network层数 增加而逐渐消失 or 爆炸, 这会导致model训练 difficult .
solution:
- using合适 激活function, such as ReLU 及其变体
- using批量归一化 (Batch Normalization)
- using残差连接 (Residual Connections)
- using梯度裁剪 (Gradient Clipping)
- using合适 初始化method
7.2 梯度裁剪
梯度裁剪 is a防止梯度爆炸 techniques, 它through限制梯度 范数来确保梯度不会过 big :
# 梯度裁剪example
learning_rate = 0.01
max_gradient_norm = 1.0
with tf.GradientTape() as tape:
y_pred = model(X)
loss = compute_loss(y_pred, y)
gradients = tape.gradient(loss, model.trainable_variables)
# 裁剪梯度
gradients, _ = tf.clip_by_global_norm(gradients, max_gradient_norm)
# updateparameter
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
7.3 计算graph issues
in TensorFlow 2.x in, 默认using即时执行模式 (Eager Execution) , 但 in 某些circumstances under , 我们可能需要using计算graph来improvingperformance.
solution:
- using
@tf.function装饰器将function转换 for 计算graph - using
tf.TensorArrayprocessing动态 long 度 序列 - 避免 in 计算graphinusing Python 控制流, using TensorFlow providing 控制流operation, such as
tf.cond()andtf.while_loop()
8. 练习
练习 1: 计算 high 阶导数
- 定义function
y = x^4 + 2x^3 - 3x^2 + 4x - 5 - using
tf.GradientTape计算一阶导数, 二阶导数 and 三阶导数 - in x = 2.0 处计算这些导数 值
练习 2: 训练线性回归model
- 生成 100 个mockdata点, usingmodel
y = 3x + 2 + 噪声 - using
tf.GradientTape训练线性回归model - 尝试不同 Learning率 (0.001, 0.01, 0.1) , 观察训练效果
- 绘制损失随时间变化 曲线
练习 3: implementation梯度 under 降optimization器
- implementation一个 simple 梯度 under 降optimization器class
- 该optimization器应该support:
- basic 梯度 under 降update
- 动量 (Momentum)
- Learning率衰减
- using该optimization器训练线性回归model
- 比较不同optimization策略 效果