PyTorch data加载 and 预processing

LearningusingPyTorch DataLoader and Dataset加载 and 预processingdata, including批processing, 打乱 and mapetc.operation, improvingmodel训练efficiency

PyTorch data加载 and 预processing

1. data加载 important 性

in 深度Learningin, data加载 and 预processing is model训练 important 环节. high 效 data加载可以improvingmodel训练 速度, 而合适 data预processing则可以improvingmodel performance. PyTorchproviding了两个corecomponent来processingdata:

  • torch.utils.data.Dataset: 表示data集 abstractionclass, 用于定义data 来sources and 读取方式
  • torch.utils.data.DataLoader: 用于批量加载data, support many thread, 打乱, 批processingetc.functions

2. 自定义Dataset

要usingPyTorch加载自定义data, 我们需要inheritanceDatasetclass并implementation以 under 两个method:

  • __len__: 返回data集 big small
  • __getitem__: 根据index返回data样本

2.1 simple example: 加载数值data

让我们creation一个 simple Dataset来加载数值data:

import torch
from torch.utils.data import Dataset, DataLoader

# 自定义Datasetclass
class CustomDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # 获取data and tag
        x = self.data[idx]
        y = self.targets[idx]
        
        # 转换 for 张量
        x = torch.tensor(x, dtype=torch.float32)
        y = torch.tensor(y, dtype=torch.float32)
        
        return x, y

# creationexampledata
data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]]
targets = [3.0, 7.0, 11.0, 15.0]  # y = 2*x1 + x2

# creationdata集instance
dataset = CustomDataset(data, targets)

# testdata集
for i in range(len(dataset)):
    x, y = dataset[i]
    print(f"样本 {i}: x={x}, y={y}")

2.2 example: 加载graph像data

for 于graph像data, 我们可以usingPILlibrary or OpenCV来读取graph像, 然 after for预processing:

import os
from PIL import Image
from torchvision import transforms

# 自定义graph像Datasetclass
class ImageDataset(Dataset):
    def __init__(self, image_dir, transform=None):
        self.image_dir = image_dir
        self.image_files = [f for f in os.listdir(image_dir) if f.endswith('.jpg') or f.endswith('.png')]
        self.transform = transform
    
    def __len__(self):
        return len(self.image_files)
    
    def __getitem__(self, idx):
        # 读取graph像
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = Image.open(img_path).convert('RGB')
        
        # application变换
        if self.transform:
            image = self.transform(image)
        
        # 这里fake设file名package含taginformation, 例such as "cat_001.jpg" 表示猫 graph像
        label = 0 if 'cat' in self.image_files[idx] else 1
        
        return image, label

3. usingDataLoader批量加载data

DataLoaderclass用于批量加载data, support以 under functions:

  • 批量加载 (batch_size)
  • data打乱 (shuffle)
  • many thread加载 (num_workers)
  • data采样 (sampler)
  • data拼接 (collate_fn)

3.1 basicusing

让我们using之 before creation CustomDataset来演示DataLoader basicusing:

# creationDataLoader
dataloader = DataLoader(
    dataset,              # data集
    batch_size=2,         # 批次 big  small 
    shuffle=True,         #  is 否打乱data
    num_workers=0         # 工作thread数, 0表示主thread
)

# 遍历DataLoader
for batch_idx, (batch_x, batch_y) in enumerate(dataloader):
    print(f"批次 {batch_idx+1}: x={batch_x}, y={batch_y}")
    print(f"批次 {batch_idx+1}形状: x={batch_x.shape}, y={batch_y.shape}")

3.2 many thread加载

for 于 big 型data集, 我们可以using many thread加载来improvingefficiency:

# creationsupport many thread DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    num_workers=4  # using4个工作thread
)

提示

in Windowssystem on , using many thread加载data时可能会遇 to issues. such as果出现error, 可以尝试将num_workers设置 for 0, or 者将code放 in if __name__ == '__main__':块in.

4. data预processing

data预processing is 深度Learningin 关键步骤, 它可以improvingmodel performance and 收敛速度. PyTorch torchvision.transformsmoduleproviding了丰富 data预processingmethod.

4.1 常用 graph像变换

for 于graph像data, 常用 预processingoperationincluding:

  • 调整 big small (Resize)
  • 裁剪 (RandomCrop, CenterCrop)
  • 翻转 (RandomHorizontalFlip, RandomVerticalFlip)
  • 旋转 (RandomRotation)
  • 归一化 (Normalize)
  • 转换 for 张量 (ToTensor)

4.2 组合变换

我们可以usingtransforms.Compose来组合 many 个变换operation:

from torchvision import transforms

# 定义变换组合
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # 调整graph像 big  small  for 224x224
    transforms.RandomHorizontalFlip(),  # 随机水平翻转
    transforms.ToTensor(),  # 转换 for 张量
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # 归一化
])

# 将变换application to Dataset
image_dataset = ImageDataset(image_dir='./images', transform=transform)

4.3 自定义变换

除了using预定义 变换 out , 我们还可以自定义变换:

# 自定义变换class
class CustomTransform:
    def __init__(self, param):
        self.param = param
    
    def __call__(self, sample):
        # implementation变换逻辑
        # sample可以 is graph像, 张量etc.
        return transformed_sample

# using自定义变换
transform = transforms.Compose([
    CustomTransform(param=0.5),
    transforms.ToTensor()
])

5. using in 置data集

PyTorch torchvision.datasetsmoduleproviding了一些常用 data集, 例such asMNIST, CIFAR10, ImageNetetc.. 我们可以直接using这些data集, 无需手动 under 载 and processing.

5.1 加载MNISTdata集

让我们加载MNIST手写numberdata集:

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

# 加载训练集
train_dataset = MNIST(
    root='./data',        # data保存path
    train=True,           # 训练集
    transform=ToTensor(), # 变换
    download=True         # such as果data不存 in 则 under 载
)

# 加载test集
test_dataset = MNIST(
    root='./data',
    train=False,
    transform=ToTensor(),
    download=True
)

# creationDataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# 遍历DataLoader
for images, labels in train_loader:
    print(f"graph像批次形状: {images.shape}")
    print(f"tag批次形状: {labels.shape}")
    break

5.2 加载CIFAR10data集

CIFAR10 is a package含10个class别 graph像data集:

from torchvision.datasets import CIFAR10

# 定义变换
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# 加载CIFAR10data集
train_dataset = CIFAR10(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

test_dataset = CIFAR10(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# 查看class别
classes = train_dataset.classes
print(f"CIFAR10class别: {classes}")

6. data增强

data增强 is a常用 data预processingtechniques, 它through for 训练datafor随机变换来增加data many 样性, from 而improvingmodel 泛化capacity. PyTorchproviding了丰富 data增强method.

6.1 常用 data增强techniques

  • 随机裁剪
  • 随机翻转
  • 随机旋转
  • 随机亮度, for 比度调整
  • 随机噪声
  • 颜色抖动

6.2 implementationdata增强

让我们 for CIFAR10data集添加data增强:

# 定义package含data增强 变换
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(32),  # 随机裁剪并调整 big  small 
    transforms.RandomHorizontalFlip(),  # 随机水平翻转
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),  # 随机颜色调整
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# test集不usingdata增强
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# 加载data集
train_dataset = CIFAR10(root='./data', train=True, transform=train_transform, download=True)
test_dataset = CIFAR10(root='./data', train=False, transform=test_transform, download=True)

7. 自定义Collatefunction

当data样本 big small 不一致时 (例such as变 long 序列) , 默认 collate_fn无法processing. 这时我们需要自定义collate_fn来processingdata 拼接.

7.1 processing变 long 序列

让我们creation一个processing变 long 序列 collate_fn:

import torch.nn.functional as F

def custom_collate_fn(batch):
    """自定义collate_fn, 用于processing变 long 序列"""
    # 分离data and tag
    data = [item[0] for item in batch]
    labels = [item[1] for item in batch]
    
    # 找出最 big  long 度
    max_len = max(len(seq) for seq in data)
    
    # 填充序列, 使其 long 度一致
    padded_data = []
    for seq in data:
        # using0for填充
        padded_seq = F.pad(seq, (0, max_len - len(seq)))
        padded_data.append(padded_seq)
    
    # 转换 for 张量
    padded_data = torch.stack(padded_data)
    labels = torch.tensor(labels)
    
    return padded_data, labels

# using自定义collate_fn
dataloader = DataLoader(
    dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=custom_collate_fn
)

实践练习

练习1: 自定义Dataset

creation一个自定义 Datasetclass来加载CSV格式 data. fake设CSVfilepackage含 many 个特征列 and 一个tag列, usingPandas来读取CSVfile, 然 after 将data转换 for 张量.

练习2: data增强

for CIFAR10data集design一套data增强变换, including随机裁剪, 随机翻转, 颜色调整etc., 然 after usingDataLoader加载data并visualization增强 after graph像.

练习3: processing变 long 序列

creation一个package含变 long 序列 data集, 然 after implementation一个自定义 collate_fn来processing这些序列, 最 after usingDataLoader加载data.

8. summarized

本tutorial介绍了PyTorch data加载 and 预processingtechniques, including:

  • usingDatasetclass自定义data集
  • usingDataLoader批量加载data
  • data预processing and 变换
  • using in 置data集
  • data增强techniques
  • 自定义collate_fnprocessing变 long 序列

high 效 data加载 and 合适 data预processing is 深度Learningmodel训练 important Basics, Master这些techniques可以improvingmodel 训练efficiency and performance.