caffe调loss方法

简介: 正文what should I do if......my loss diverges? (increases by order of magnitude, goes to inf. or NaN)lower the learning rateraise momentum (with cor...

正文

what should I do if...
...my loss diverges? (increases by order of magnitude, goes to inf. or NaN)
lower the learning rate
raise momentum (with corresponding learning rate drop)
raise weight decay
raise batch size
use gradient clipping (limit the L2 norm of the gradient to a particular value at each iteration; shrink it to that norm if greater)
try another solver: momentum SGD, ADAM, RMSProp, ...
try a smaller initialization (e.g., for a Gaussian init., lower the stdev.)

what should I do if...
...my loss doesn’t improve / gets stuck / drops slowly?

  • raise the learning rate
  • (maybe) lower momentum, weight decay, and/or batch size
  • try another solver: momentum SGD, ADAM, RMSProp, ...
  • transfer a pre-trained (e.g. on ImageNet) initialization, if possible
  • use a larger initialization (in particular, make sure you didn’t zero-initialize any multiplicative weights in intermediate layers)
  • use a “smarter” initialization (e.g., for linear layers followed by ReLUs, try the msra initialization in Caffe)

  • remove some layers to make the network shallower
    at least to start!
    a strategy for model design: begin with a simple, trainable network; “deepen” it by adding new layers one-by-one

-modify the architecture to improve gradient flow:
batch normalization
residual learning [ResNet]
intermediate losses [GoogLeNet]
other tricks

be patient! (go outside?)
deep learning can take a long time
training AlexNet in 2012: 12 days
although this is down to 1 day in 2015!
loss hovers around the chance value of ln(1000) ≅ 6.908 for the first 1000+ iterations (~1 hour on 2012 GPU)
training ResNet-152 in 2015: 1-2 months (on 8 GPUs!)
the best configurations (net architectures, solvers) at convergence are often not the ones that train fastest early on
some tricks to speed up learning can be “greedy” rather than ultimately beneficial

补充一个:如果显存不够,考虑设定iter_size来增大batch_size

reference

https://docs.google.com/presentation/d/1HxGdeq8MPktHaPb-rlmYYQ723iWzq9ur6Gjo71YiG0Y/edit#slide=id.g8629ab2c8_0_60

目录
相关文章
|
7月前
|
存储 机器学习/深度学习 PyTorch
【从零开始学习深度学习】19. Pytorch中如何存储与读取模型:torch.save、torch.load与state_dict对象
【从零开始学习深度学习】19. Pytorch中如何存储与读取模型:torch.save、torch.load与state_dict对象
|
8月前
|
人工智能 算法 C++
极智AI | 讲解TensorRT显式batch和隐式batch
大家好,我是极智视界,本文讲解一下 TensorRT 显式batch 和 隐式batch。
149 0
|
机器学习/深度学习 PyTorch 算法框架/工具
# Pytorch 中可以直接调用的Loss Functions总结:(二)
# Pytorch 中可以直接调用的Loss Functions总结:(二)
175 0
# Pytorch 中可以直接调用的Loss Functions总结:(二)
|
存储 测试技术
测试模型时,为什么要with torch.no_grad(),为什么要model.eval(),如何使用with torch.no_grad(),model.eval(),同时使用还是只用其中之一
在测试模型时,我们通常使用with torch.no_grad()和model.eval()这两个方法来确保模型在评估过程中的正确性和效率。
1202 0
|
机器学习/深度学习 算法 PyTorch
【菜菜的CV进阶之路-Pytorch基础-model.eval】同一个模型测试:shuffle=False和shuffle=True 结果差异很大
【菜菜的CV进阶之路-Pytorch基础-model.eval】同一个模型测试:shuffle=False和shuffle=True 结果差异很大
291 0
【菜菜的CV进阶之路-Pytorch基础-model.eval】同一个模型测试:shuffle=False和shuffle=True 结果差异很大
|
数据可视化 PyTorch 算法框架/工具
loss放在GPU上面的,现在我需要将loss的值放在visdom上面画出来,怎么处理?
在这个例子中,我们首先初始化了Visdom客户端。接下来,我们假设loss值已经在GPU上计算,并将其定义为torch张量。然后,我们使用.cpu()方法将loss从GPU设备移动到CPU,并使用.detach()方法分离出其计算图依赖关系,并将其转换为NumPy数组。最后,我们使用Visdom的vis.line()方法绘制loss曲线。其中,X表示横坐标,Y表示纵坐标,win指定窗口名称,name指定曲线名称,update指定更新模式(‘append’表示追加数据)。
307 0
|
PyTorch 算法框架/工具 图计算
Pytorch中autograd.Variable.backward的grad_varables参数个人理解浅见
Pytorch中autograd.Variable.backward的grad_varables参数个人理解浅见
142 0
Pytorch中autograd.Variable.backward的grad_varables参数个人理解浅见
|
PyTorch 算法框架/工具
Pytorch中的.backward()方法
Pytorch中的.backward()方法
200 0
Pytorch中的.backward()方法
|
机器学习/深度学习 存储 并行计算
一个快速构造GAN的教程:如何用pytorch构造DCGAN(下)
一个快速构造GAN的教程:如何用pytorch构造DCGAN
174 0
一个快速构造GAN的教程:如何用pytorch构造DCGAN(下)
|
机器学习/深度学习 存储 PyTorch
一个快速构造GAN的教程:如何用pytorch构造DCGAN(上)
一个快速构造GAN的教程:如何用pytorch构造DCGAN
174 0
一个快速构造GAN的教程:如何用pytorch构造DCGAN(上)