PyTorch data加载 and 预processing
1. data加载 important 性
in 深度Learningin, data加载 and 预processing is model训练 important 环节. high 效 data加载可以improvingmodel训练 速度, 而合适 data预processing则可以improvingmodel performance. PyTorchproviding了两个corecomponent来processingdata:
torch.utils.data.Dataset: 表示data集 abstractionclass, 用于定义data 来sources and 读取方式torch.utils.data.DataLoader: 用于批量加载data, support many thread, 打乱, 批processingetc.functions
2. 自定义Dataset
要usingPyTorch加载自定义data, 我们需要inheritanceDatasetclass并implementation以 under 两个method:
__len__: 返回data集 big small__getitem__: 根据index返回data样本
2.1 simple example: 加载数值data
让我们creation一个 simple Dataset来加载数值data:
import torch
from torch.utils.data import Dataset, DataLoader
# 自定义Datasetclass
class CustomDataset(Dataset):
def __init__(self, data, targets):
self.data = data
self.targets = targets
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# 获取data and tag
x = self.data[idx]
y = self.targets[idx]
# 转换 for 张量
x = torch.tensor(x, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)
return x, y
# creationexampledata
data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
targets = [3.0, 7.0, 11.0, 15.0] # y = 2*x1 + x2
# creationdata集instance
dataset = CustomDataset(data, targets)
# testdata集
for i in range(len(dataset)):
x, y = dataset[i]
print(f"样本 {i}: x={x}, y={y}")
2.2 example: 加载graph像data
for 于graph像data, 我们可以usingPILlibrary or OpenCV来读取graph像, 然 after for预processing:
import os
from PIL import Image
from torchvision import transforms
# 自定义graph像Datasetclass
class ImageDataset(Dataset):
def __init__(self, image_dir, transform=None):
self.image_dir = image_dir
self.image_files = [f for f in os.listdir(image_dir) if f.endswith('.jpg') or f.endswith('.png')]
self.transform = transform
def __len__(self):
return len(self.image_files)
def __getitem__(self, idx):
# 读取graph像
img_path = os.path.join(self.image_dir, self.image_files[idx])
image = Image.open(img_path).convert('RGB')
# application变换
if self.transform:
image = self.transform(image)
# 这里fake设file名package含taginformation, 例such as "cat_001.jpg" 表示猫 graph像
label = 0 if 'cat' in self.image_files[idx] else 1
return image, label
3. usingDataLoader批量加载data
DataLoaderclass用于批量加载data, support以 under functions:
- 批量加载 (batch_size)
- data打乱 (shuffle)
- many thread加载 (num_workers)
- data采样 (sampler)
- data拼接 (collate_fn)
3.1 basicusing
让我们using之 before creation CustomDataset来演示DataLoader basicusing:
# creationDataLoader
dataloader = DataLoader(
dataset, # data集
batch_size=2, # 批次 big small
shuffle=True, # is 否打乱data
num_workers=0 # 工作thread数, 0表示主thread
)
# 遍历DataLoader
for batch_idx, (batch_x, batch_y) in enumerate(dataloader):
print(f"批次 {batch_idx+1}: x={batch_x}, y={batch_y}")
print(f"批次 {batch_idx+1}形状: x={batch_x.shape}, y={batch_y.shape}")
3.2 many thread加载
for 于 big 型data集, 我们可以using many thread加载来improvingefficiency:
# creationsupport many thread DataLoader
dataloader = DataLoader(
dataset,
batch_size=2,
shuffle=True,
num_workers=4 # using4个工作thread
)
提示
in Windowssystem on , using many thread加载data时可能会遇 to issues. such as果出现error, 可以尝试将num_workers设置 for 0, or 者将code放 in if __name__ == '__main__':块in.
4. data预processing
data预processing is 深度Learningin 关键步骤, 它可以improvingmodel performance and 收敛速度. PyTorch torchvision.transformsmoduleproviding了丰富 data预processingmethod.
4.1 常用 graph像变换
for 于graph像data, 常用 预processingoperationincluding:
- 调整 big small (Resize)
- 裁剪 (RandomCrop, CenterCrop)
- 翻转 (RandomHorizontalFlip, RandomVerticalFlip)
- 旋转 (RandomRotation)
- 归一化 (Normalize)
- 转换 for 张量 (ToTensor)
4.2 组合变换
我们可以usingtransforms.Compose来组合 many 个变换operation:
from torchvision import transforms
# 定义变换组合
transform = transforms.Compose([
transforms.Resize((224, 224)), # 调整graph像 big small for 224x224
transforms.RandomHorizontalFlip(), # 随机水平翻转
transforms.ToTensor(), # 转换 for 张量
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # 归一化
])
# 将变换application to Dataset
image_dataset = ImageDataset(image_dir='./images', transform=transform)
4.3 自定义变换
除了using预定义 变换 out , 我们还可以自定义变换:
# 自定义变换class
class CustomTransform:
def __init__(self, param):
self.param = param
def __call__(self, sample):
# implementation变换逻辑
# sample可以 is graph像, 张量etc.
return transformed_sample
# using自定义变换
transform = transforms.Compose([
CustomTransform(param=0.5),
transforms.ToTensor()
])
5. using in 置data集
PyTorch torchvision.datasetsmoduleproviding了一些常用 data集, 例such asMNIST, CIFAR10, ImageNetetc.. 我们可以直接using这些data集, 无需手动 under 载 and processing.
5.1 加载MNISTdata集
让我们加载MNIST手写numberdata集:
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
# 加载训练集
train_dataset = MNIST(
root='./data', # data保存path
train=True, # 训练集
transform=ToTensor(), # 变换
download=True # such as果data不存 in 则 under 载
)
# 加载test集
test_dataset = MNIST(
root='./data',
train=False,
transform=ToTensor(),
download=True
)
# creationDataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
# 遍历DataLoader
for images, labels in train_loader:
print(f"graph像批次形状: {images.shape}")
print(f"tag批次形状: {labels.shape}")
break
5.2 加载CIFAR10data集
CIFAR10 is a package含10个class别 graph像data集:
from torchvision.datasets import CIFAR10
# 定义变换
transform = transforms.Compose([
transforms.Resize((32, 32)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# 加载CIFAR10data集
train_dataset = CIFAR10(
root='./data',
train=True,
transform=transform,
download=True
)
test_dataset = CIFAR10(
root='./data',
train=False,
transform=transform,
download=True
)
# 查看class别
classes = train_dataset.classes
print(f"CIFAR10class别: {classes}")
6. data增强
data增强 is a常用 data预processingtechniques, 它through for 训练datafor随机变换来增加data many 样性, from 而improvingmodel 泛化capacity. PyTorchproviding了丰富 data增强method.
6.1 常用 data增强techniques
- 随机裁剪
- 随机翻转
- 随机旋转
- 随机亮度, for 比度调整
- 随机噪声
- 颜色抖动
6.2 implementationdata增强
让我们 for CIFAR10data集添加data增强:
# 定义package含data增强 变换
train_transform = transforms.Compose([
transforms.RandomResizedCrop(32), # 随机裁剪并调整 big small
transforms.RandomHorizontalFlip(), # 随机水平翻转
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2), # 随机颜色调整
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# test集不usingdata增强
test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# 加载data集
train_dataset = CIFAR10(root='./data', train=True, transform=train_transform, download=True)
test_dataset = CIFAR10(root='./data', train=False, transform=test_transform, download=True)
7. 自定义Collatefunction
当data样本 big small 不一致时 (例such as变 long 序列) , 默认 collate_fn无法processing. 这时我们需要自定义collate_fn来processingdata 拼接.
7.1 processing变 long 序列
让我们creation一个processing变 long 序列 collate_fn:
import torch.nn.functional as F
def custom_collate_fn(batch):
"""自定义collate_fn, 用于processing变 long 序列"""
# 分离data and tag
data = [item[0] for item in batch]
labels = [item[1] for item in batch]
# 找出最 big long 度
max_len = max(len(seq) for seq in data)
# 填充序列, 使其 long 度一致
padded_data = []
for seq in data:
# using0for填充
padded_seq = F.pad(seq, (0, max_len - len(seq)))
padded_data.append(padded_seq)
# 转换 for 张量
padded_data = torch.stack(padded_data)
labels = torch.tensor(labels)
return padded_data, labels
# using自定义collate_fn
dataloader = DataLoader(
dataset,
batch_size=2,
shuffle=True,
collate_fn=custom_collate_fn
)
实践练习
练习1: 自定义Dataset
creation一个自定义 Datasetclass来加载CSV格式 data. fake设CSVfilepackage含 many 个特征列 and 一个tag列, usingPandas来读取CSVfile, 然 after 将data转换 for 张量.
练习2: data增强
for CIFAR10data集design一套data增强变换, including随机裁剪, 随机翻转, 颜色调整etc., 然 after usingDataLoader加载data并visualization增强 after graph像.
练习3: processing变 long 序列
creation一个package含变 long 序列 data集, 然 after implementation一个自定义 collate_fn来processing这些序列, 最 after usingDataLoader加载data.
8. summarized
本tutorial介绍了PyTorch data加载 and 预processingtechniques, including:
- using
Datasetclass自定义data集 - using
DataLoader批量加载data - data预processing and 变换
- using in 置data集
- data增强techniques
- 自定义
collate_fnprocessing变 long 序列
high 效 data加载 and 合适 data预processing is 深度Learningmodel训练 important Basics, Master这些techniques可以improvingmodel 训练efficiency and performance.