【模型性能杀器解读】如果项目的模型遇到瓶颈，用这些Tricks就对了！！！（二）-阿里云开发者社区

【模型性能杀器解读】如果项目的模型遇到瓶颈，用这些Tricks就对了！！！（二）

2023-05-18 261

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【模型性能杀器解读】如果项目的模型遇到瓶颈，用这些Tricks就对了！！！（二）

4、Training Refinements

4.1、Cosine Learning Rate Decay

Loshchilov等人提出了一种余弦退火策略。一种简化的方法是通过遵循余弦函数将学习率从初始值降低到0。假设批次总数为T(忽略预热阶段)，那么在批次T时，学习率tm计算为：

可以看出，余弦衰减在开始时缓慢地降低了学习速率，然后在中间几乎变成线性减少，在结束时再次减缓。与step衰减相比，余弦衰减从一开始就对学习进行衰减，但一直持续到步进衰减将学习率降低了10倍，从而潜在地提高了训练进度。

import torch
optim = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)

4.2、Label Smoothing

对于输出预测的标签不可能像真是的label一样真是，因此这里进行一定的平滑策略，具体的Label Smoothing平滑规则为：

# -*- coding: utf-8 -*-
"""
qi=1-smoothing(if i=y)
qi=smoothing / (self.size - 1) (otherwise)#所以默认可以fill这个数，只在i=y的地方执行1-smoothing
另外KLDivLoss和crossentroy的不同是前者有一个常数
predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
                                 [0, 0.9, 0.2, 0.1, 0], 
                                 [1, 0.2, 0.7, 0.1, 0]])
对应的label为
tensor([[ 0.0250,  0.0250,  0.9000,  0.0250,  0.0250],
        [ 0.9000,  0.0250,  0.0250,  0.0250,  0.0250],
        [ 0.0250,  0.0250,  0.0250,  0.9000,  0.0250]])
区别于one-hot的
tensor([[ 0.,  0.,  1.,  0.,  0.],
        [ 1.,  0.,  0.,  0.,  0.],
        [ 0.,  1.,  0.,  0.,  0.]])
"""
import torch
import torch.nn as nn
from torch.autograd import Variable
import matplotlib.pyplot as plt
import numpy as np
class LabelSmoothing(nn.Module):
    "Implement label smoothing.  size表示类别总数  "
    def __init__(self, size, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        #self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing#if i=y的公式
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
    def forward(self, x, target):
        """
        x表示输入 (N，M)N个样本，M表示总类数，每一个类的概率log P
        target表示label（M，）
        """
        assert x.size(1) == self.size
        true_dist = x.data.clone()#先深复制过来
        #print true_dist
        true_dist.fill_(self.smoothing / (self.size - 1))#otherwise的公式
        #print true_dist
        #变成one-hot编码，1表示按列填充，
        #target.data.unsqueeze(1)表示索引,confidence表示填充的数字
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))
if __name__=="__main__":
# Example of label smoothing.
    crit = LabelSmoothing(size=5,smoothing= 0.1)
    #predict.shape 3 5
    predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
                                 [0, 0.9, 0.2, 0.1, 0], 
                                 [1, 0.2, 0.7, 0.1, 0]])
    v = crit(Variable(predict.log()), 
             Variable(torch.LongTensor([2, 1, 0])))
    # Show the target distributions expected by the system.
    plt.imshow(crit.true_dist)

4.3、Knowledge Distillation

在训练过程中增加了一个蒸馏损失，以惩罚Teacher模型和Student模型的softmax输出之间的差异。给定一个输入，设p为真概率分布，z和r分别为学生模型和教师模型最后全连通层的输出。损失改进为：

4.4、Mixup Training

在Mixup中，每次我们随机抽取两个例子和。然后对这2个sample进行加权线性插值，得到一个新的sample:

其中

import numpy as np
import torch
def mixup_data(x, y, alpha=1.0, use_cuda=True):
    if alpha > 0.:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1.
    batch_size = x.size()[0]
    if use_cuda:
        index = torch.randperm(batch_size).cuda()
    else:
        index = torch.randperm(batch_size)
    mixed_x = lam * x + (1 - lam) * x[index,:] # 自己和打乱的自己进行叠加
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam
def mixup_criterion(y_a, y_b, lam):
    return lambda criterion, pred: lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

【模型性能杀器解读】如果项目的模型遇到瓶颈，用这些Tricks就对了！！！（二）

4、Training Refinements

4.1、Cosine Learning Rate Decay

4.2、Label Smoothing

4.3、Knowledge Distillation

4.4、Mixup Training

4.5、Experiment Results

5、Transfer Learning

ModelScope模型即服务

热门文章

最新文章

相关电子书