YOLOv10目标检测创新改进与实战案例专栏
专栏链接: YOLOv10 创新改进有效涨点
摘要
本论文旨在开发现代、高效、轻量的密集预测模型,并在参数、浮点运算次数与性能之间寻求平衡。虽然倒置残差块(IRB)是轻量级卷积神经网络(CNN)的重要基础,但在基于注意力的研究中尚缺类似的构件。本研究从统一视角出发,结合高效IRB和有效的Transformer组件,重新考虑轻量级基础架构。我们将基于CNN的IRB扩展到基于注意力的模型,并提出了一种单残差元移动块(MMB)用于轻量级模型设计。基于简单而有效的设计原则,我们推出了一种新型的倒置残差移动块(iRMB),并以此为基础构建了一个类似于ResNet的高效模型(EMO),适用于下游任务。在ImageNet-1K、COCO2017和ADE20K基准上的大量实验表明,我们的EMO在性能上超越了最先进的方法,例如,EMO-1M/2M/5M在Top-1准确率上分别达到了71.5、75.1和78.4,超过了同等级别的CNN-/基于注意力的模型,同时在参数、效率和准确度上取得了良好的权衡:在iPhone14上运行速度比EdgeNeXt快2.8-4.0倍。
创新点
iRMB (Inverted Residual Mobile Block) 的创新点在于其结合了CNN和Transformer架构的优点,为移动端应用设计了一个简单而高效的模块。这一设计旨在解决移动设备上对存储和计算资源限制下的高效模型需求,同时克服了现有轻量级CNN模型和Transformer模型的局限性。以下是iRMB的主要创新点:
融合CNN与Transformer的优点:iRMB吸收了CNN在建模短距离依赖方面的高效性,以及Transformer在动态建模长距离交互方面的能力。这种融合提供了一个均衡的解决方案,使得模型既能捕捉局部特征,也能理解全局上下文。
简单高效的设计:通过精心设计的Meta Mobile Block概念,iRMB通过级联的多头自注意力(MHSA)和卷积运算实现了高效的特征提取和信息流动。这种设计不仅保持了模型的高效性,而且简化了模型结构,便于移动端应用的部署和优化。
优化的资源消耗:相对于传统的Transformer模型,iRMB通过特定的设计减少了参数量和计算量(FLOPs),使其更适合在资源受限的移动设备上运行。这一点通过在ImageNet-1K、COCO2017和ADE20K等基准测试中展示了其相对于同类模型的优越性能。
实现特定的技术突破:iRMB的设计克服了传统CNN模型由于其静态归纳偏差而导致的性能瓶颈,同时也解决了Transformer模型在移动设备上部署时由于参数和计算量大导致的问题。通过这种设计,iRMB为移动端高效模型的开发提供了新的思路。
灵活性和泛化能力:iRMB不仅在图像分类任务上表现出色,还在目标检测和语义分割等多个计算机视觉任务中展现了其优异的性能。这证明了iRMB不仅是一个高效的模块,而且具有良好的泛化能力,可以适用于多种视觉处理任务。
yolov10 引入
class iRMB(nn.Module):
def __init__(self, dim_in, dim_out, norm_in=True, has_skip=True, exp_ratio=1.0, norm_layer='bn_2d',
act_layer='relu', v_proj=True, dw_ks=3, stride=1, dilation=1, se_ratio=0.0, dim_head=8, window_size=7,
attn_s=True, qkv_bias=False, attn_drop=0., drop=0., drop_path=0., v_group=False, attn_pre=False):
super().__init__()
self.norm = get_norm(norm_layer)(dim_in) if norm_in else nn.Identity()
dim_mid = int(dim_in * exp_ratio)
self.has_skip = (dim_in == dim_out and stride == 1) and has_skip
self.attn_s = attn_s
if self.attn_s:
assert dim_in % dim_head == 0, 'dim should be divisible by num_heads'
self.dim_head = dim_head
self.window_size = window_size
self.num_head = dim_in // dim_head
self.scale = self.dim_head ** -0.5
self.attn_pre = attn_pre
self.qk = ConvNormAct(dim_in, int(dim_in * 2), kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer='none')
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, groups=self.num_head if v_group else 1, bias=qkv_bias,
norm_layer='none', act_layer=act_layer, inplace=inplace)
self.attn_drop = nn.Dropout(attn_drop)
else:
if v_proj:
self.v = ConvNormAct(dim_in, dim_mid, kernel_size=1, bias=qkv_bias, norm_layer='none',
act_layer=act_layer, inplace=inplace)
else:
self.v = nn.Identity()
self.conv_local = ConvNormAct(dim_mid, dim_mid, kernel_size=dw_ks, stride=stride, dilation=dilation,
groups=dim_mid, norm_layer='bn_2d', act_layer='silu', inplace=inplace)
self.se = SqueezeExcite(dim_mid, rd_ratio=se_ratio, act_layer=get_act(act_layer)) if se_ratio > 0.0 else nn.Identity()
self.proj_drop = nn.Dropout(drop)
self.proj = ConvNormAct(dim_mid, dim_out, kernel_size=1, norm_layer='none', act_layer='none', inplace=inplace)
self.drop_path = DropPath(drop_path) if drop_path else nn.Identity()
def forward(self, x):
shortcut = x
x = self.norm(x)
B, C, H, W = x.shape
if self.attn_s:
# padding
if self.window_size <= 0:
window_size_W, window_size_H = W, H
else:
window_size_W, window_size_H = self.window_size, self.window_size
pad_l, pad_t = 0, 0
pad_r = (window_size_W - W % window_size_W) % window_size_W
pad_b = (window_size_H - H % window_size_H) % window_size_H
x = F.pad(x, (pad_l, pad_r, pad_t, pad_b, 0, 0,))
n1, n2 = (H + pad_b) // window_size_H, (W + pad_r) // window_size_W
x = rearrange(x, 'b c (h1 n1) (w1 n2) -> (b n1 n2) c h1 w1', n1=n1, n2=n2).contiguous()
# attention
b, c, h, w = x.shape
qk = self.qk(x)
qk = rearrange(qk, 'b (qk heads dim_head) h w -> qk b heads (h w) dim_head', qk=2, heads=self.num_head,
dim_head=self.dim_head).contiguous()
q, k = qk[0], qk[1]
attn_spa = (q @ k.transpose(-2, -1)) * self.scale
attn_spa = attn_spa.softmax(dim=-1)
attn_spa = self.attn_drop(attn_spa)
if self.attn_pre:
x = rearrange(x, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ x
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
x_spa = self.v(x_spa)
else:
v = self.v(x)
v = rearrange(v, 'b (heads dim_head) h w -> b heads (h w) dim_head', heads=self.num_head).contiguous()
x_spa = attn_spa @ v
x_spa = rearrange(x_spa, 'b heads (h w) dim_head -> b (heads dim_head) h w', heads=self.num_head, h=h,
w=w).contiguous()
# unpadding
x = rearrange(x_spa, '(b n1 n2) c h1 w1 -> b c (h1 n1) (w1 n2)', n1=n1, n2=n2).contiguous()
if pad_r > 0 or pad_b > 0:
x = x[:, :, :H, :W].contiguous()
else:
x = self.v(x)
x = x + self.se(self.conv_local(x)) if self.has_skip else self.se(self.conv_local(x))
x = self.proj_drop(x)
x = self.proj(x)
x = (shortcut + self.drop_path(x)) if self.has_skip else x
return x
task与yaml配置
详见:https://blog.csdn.net/shangyanaf/article/details/140072635