Abstract 摘要

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster, tests 10× faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at web.

本文提出了一种快速的基于区域的卷积网络方法（fast R-CNN）用于目标检测。Fast R-CNN 建立在以前使用的深卷积网络有效地分类目标的成果上。相比于之前的研究工作，Fast R-CNN 采用了多项创新提高了训练和测试速度，同时也提高了检测准确度。Fast R-CNN 训练非常深的 VGG16 网络比 R-CNN 快 9 倍，测试时快 213 倍，并在 PASCAL VOC 上得到了更高的准确度。与 SPPnet 相比，Fast R-CNN 训练 VGG16 网络比他快 3 倍，测试速度快 10 倍，并且更准确。Fast R-CNN 的 Python 和 C ++（使用 Caffe）实现以 MIT 开源许可证发布在链接。

1. Introduction 引言

Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

最近，深度卷积网络[14, 16]已经显著提高了图像分类[14]和目标检测[9, 19]的准确性。与图像分类相比，目标检测是一个更具挑战性的任务，需要更复杂的方法来解决。由于这种复杂性，当前的方法（例如，[9, 11, 19, 25]）采用多级 pipelines 的方式训练模型，既慢且精度不高。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.

复杂性的产生是因为检测需要目标的精确定位，这就导致两个主要的难点。首先，必须处理大量候选目标位置（通常称为“proposals”）。第二，这些候选框仅提供粗略定位，其必须被精细化以实现精确定位。这些问题的解决方案经常会影响速度、准确性或简洁性。

In this paper, we streamline the training process for state-of-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

在本文中，我们简化了最先进的基于卷积网络的目标检测器的训练过程[9, 11]。我们提出一个单阶段训练算法，联合学习候选框分类和修正他们的空间位置。

The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).

结果方法能够训练非常深的检测网络（例如 VGG16），其网络比 R-CNN 快 9 倍，比 SPPnet 快 3 倍。在运行时，检测网络在 PASCAL VOC 2012 数据集上实现最高准确度，其中 mAP 为 66％（R-CNN 为 62％），每张图像处理时间为 0.3 秒，不包括候选框的生成（注：所有的时间都是使用一个超频 875MHz 的 Nvidia K40 GPU 测试的）。

1.1 RCNN and SPPnet

The Region-based Convolutional Network method (RCNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks: 1. Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned. 2. Training is expensive in space and time. For SVM and boundingbox regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage. 3. Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

基于区域的卷积网络方法（RCNN）[9]通过使用深度卷积网络来分类目标候选框，获得了很高的目标检测精度。然而，R-CNN 具有明显的缺点：

训练过程是多级 pipeline。R-CNN 首先使用目标候选框对卷积神经网络使用 log 损失进行 fine-tunes。然后，它将卷积神经网络得到的特征送入 SVM。这些 SVM 作为目标检测器，替代通过 fine-tunes 学习的 softmax 分类器。在第三个训练阶段，学习 bounding-box 回归器。

训练在时间和空间上是的开销很大。对于 SVM 和 boundingbox 回归训练，从每个图像中的每个目标候选框提取特征，并写入磁盘。对于 VOC07 trainval 上的 5k 个图像，使用如 VGG16 非常深的网络时，这个过程在单个 GPU 上需要 2.5 天。这些特征需要数百 GB 的存储空间。

目标检测速度很慢。在测试时，从每个测试图像中的每个目标候选框提取特征。用 VGG16 网络检测目标时，每个图像需要 47 秒（在 GPU 上）。

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6×6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling [15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.

R-CNN 很慢是因为它为每个目标候选框进行卷积神经网络前向传递，而没有共享计算。SPPnet 网络[11]提出通过共享计算加速 RCNN。SPPnet 计算整个输入图像的卷积特征图，然后使用从共享特征图提取的特征向量来对每个候选框进行分类。通过最大池化将候选框内的特征图转化为固定大小的输出（例如 6×6）来提取针对候选框的特征。多输出尺寸被池化，然后连接成空间金字塔池[15]。SPPnet 在测试时将 R-CNN 加速 10 到 100 倍。由于更快的候选框特征提取，训练时间也减少了 3 倍。

SPPnet also has notable drawbacks. Like R-CNN, training is a multistage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

SPP 网络也有显著的缺点。像 R-CNN 一样，训练过程是一个多级 pipeline，涉及提取特征、使用 log 损失对网络进行 fine-tuning、训练 SVM 分类器以及最后拟合检测框回归。特征也要写入磁盘。但与 R-CNN 不同，在[11]中提出的 fine-tuning 算法不能更新在空间金字塔池之前的卷积层。不出所料，这种局限性（固定的卷积层）限制了深层网络的精度。

1.2 Contributions 贡献

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages: 1. Higher detection quality (mAP) than R-CNN, SPPnet 2. Training is single-stage, using a multi-task loss 3. Training can update all network layers 4. No disk storage is required for feature caching

我们提出一种新的训练算法，修正了 R-CNN 和 SPPnet 的缺点，同时提高了速度和准确性。因为它能比较快地进行训练和测试，我们称之为 Fast R-CNN。Fast RCNN 方法有以下几个优点： 1. 比 R-CNN 和 SPPnet 具有更高的目标检测精度（mAP）。 2. 训练是使用多任务损失的单阶段训练。 3. 训练可以更新所有网络层参数。 4. 不需要磁盘空间缓存特征。

Fast R-CNN is written in Python and C++ (Caffe [13]) and is available under the open-source MIT License at https://github.com/rbgirshick/fastrcnn. 3

Fast R-CNN 使用 Python 和 C++(Caffe[13])编写，以 MIT 开源许可证发布在：https://github.com/rbgirshick/fast-rcnn。

2. Fast R-CNN architecture and training Fast R-CNN 架构与训练

Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

Fast R-CNN 的架构如图 1 所示。Fast R-CNN 网络将整个图像和一组候选框作为输入。网络首先使用几个卷积层（conv）和最大池化层来处理整个图像，以产生卷积特征图。然后，对于每个候选框，RoI 池化层从特征图中提取固定长度的特征向量。每个特征向量被送入一系列全连接（fc）层中，其最终分支成两个同级输出层：一个输出 K 个类别加上 1 个包含所有背景类别的 Softmax 概率估计，另一个层输出 K 个类别的每一个类别输出四个实数值。每组 4 个值表示 K 个类别中一个类别的修正后检测框位置。

2.1 The RoI pooling layer RoI 池化层

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

RoI 池化层使用最大池化将任何有效的 RoI 内的特征转换成具有 H×W（例如，7×7）的固定空间范围的小特征图，其中 H 和 W 是层的超参数，独立于任何特定的 RoI。在本文中，RoI 是卷积特征图中的一个矩形窗口。每个 RoI 由指定其左上角(r,c)及其高度和宽度(h,w) 的四元组(r,c,h,w)定义。

RoI max pooling works by dividing the h×w RoI window into an H ×W grid of sub-windows of approximate size h/H×w/W and then maxpooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI 最大池化通过将大小为 h×w 的 RoI 窗口分割成 H×W 个网格，子窗口大小约为 h/H×w/W，然后对每个子窗口执行最大池化，并将输出合并到相应的输出网格单元中。同标准的最大池化一样，池化操作独立应用于每个特征图通道。RoI 层只是 SPPnets[11]中使用的空间金字塔池层的特例，其只有一个金字塔层。我们使用[11]中给出的池化子窗口计算方法。

2.2 Initializing from pre-trained networks 从预训练网络初始化

We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

我们实验了三个预训练的 ImageNet [4]网络，每个网络有五个最大池化层和 5 至 13 个卷积层（网络详细信息见 4.1 节）。当预训练网络初始化 fast R-CNN 网络时，其经历三个变换。

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).

首先，最后的最大池化层由 RoI 池层代替，其将 H 和 W 设置为与网络的第一个全连接层兼容的配置（例如，对于 VGG16，H=W=7）。

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).

然后，网络的最后一格全连接层和 Softmax（其被训练用于 1000 类 ImageNet 分类）被替换为前面描述的两个同级层（全连接层和 K+1 个类别的 Softmax 以及特定类别的 bounding-box 回归）。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

最后，网络被修改为采用两个数据输入：图像的列表和这些图像中的 RoI 的列表。

2.3 Finetuning for detection 微调

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

用反向传播训练所有网络权重是 Fast R-CNN 的重要能力。首先，让我们阐明为什么 SPPnet 无法更新低于空间金字塔池化层的权重。

The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

根本原因是当每个训练样本（即 RoI）来自不同的图像时，通过 SPP 层的反向传播是非常低效的，这正是训练 R-CNN 和 SPPnet 网络的方法。低效是因为每个 RoI 可能具有非常大的感受野，通常跨越整个输入图像。由于正向传播必须处理整个感受野，训练输入很大（通常是整个图像）。

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

我们提出了一种更有效的训练方法，利用训练期间的特征共享。在 Fast RCNN 网络训练中，随机梯度下降（SGD）的小批量是被分层采样的，首先采样 N 个图像，然后从每个图像采样 R/N 个 RoI。关键的是，来自同一图像的 RoI 在前向和后向传播中共享计算和内存。减小 N，就减少了小批量的计算。例如，当 N=2 和 R=128 时，得到的训练方案比从 128 幅不同的图采样一个 RoI（即 R-CNN 和 SPPnet 的策略）快 64 倍。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

这个策略的一个令人担心的问题是它可能导致训练收敛变慢，因为来自相同图像的 RoI 是相关的。这个问题似乎在实际情况下并不存在，当 N=2 和 R=128 时，我们使用比 R-CNN 更少的 SGD 迭代就获得了良好的结果。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了分层采样，Fast R-CNN 使用了一个精细的训练过程，在 finetuning 阶段联合优化 Softmax 分类器和 bounding-box 回归，而不是分别在三个独立的阶段训练 softmax 分类器、SVM 和回归器[9, 11]。下面将详细描述该过程（损失、小批量采样策略、通过 RoI 池化层的反向传播和 SGD 超参数）。

Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), R o I ) , p = ( p 0 , … , p K ) , over K + 1 categories. As usual, p is computed by a softmax over the K + 1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, ,k 表示 K , for each of the K object classes, indexed by k. We use the parameterization for tk given in [9], in which tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.

多任务损失。Fast R-CNN 网络具有两个同级输出层。第一个输出在 K + 1 个类别上的离散概率分布 (每个 R o I ) , p = ( p 0 , … , p K ) 。通常, 基于全连接层的 K + 1 个输出通过 Softmax 来计算 p 。第二个输出层输出 bounding-box 回归偏移, 即 ,k 表示 K 个类别的索引。我们使用[9]中给出方法对 t ^k 进行参数化, 其中 t^k 指定相对于候选框的尺度不变转换和对数空间高度/宽度移位。

Each training RoI is labeled with a ground-truth class u uu and a groundtruth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:

in which L_{c l s} ( p , u ) = − log ⁡ p_u is log loss for true class u .

每个训练的 RoI 用类真值 u 和 bounding-box 回归目标真值 v 打上标签。我们对每个标记的 RoI 使用多任务损失 L 以联合训练分类和 bounding-box 回归:

其中 L _{c l s} ( p , u ) = − log ⁡ p_u, 是类真值 u 的 log 损失。

The second task loss, L _{l o c} , is defined over a tuple of true boundingbox regression targets for class u , v = ( v_x , v_y , v _w , v _h ) , and a predicted tuple t ^u =( tux ; tuy; tuw; tuh ) , again for class u \mathrm{u}u. The Iverson bracket indicator function [ u ≥ 1 ]evaluates to 1 when u ≥ 1 u \geq 1u≥1 and 0 otherwise. By convention the catch-all background class is labeled u = 0 u=0u=0. For background RoIs there is no notion of a ground-truth bounding box and hence L_{l o c} is ignored. For bounding-box regression, we use the loss

in which

is a robust L₁ loss that is less sensitive to outliers than the L₂ loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.

对于类真值 u uu, 第二个损失 L _{l o c} 是定义在 bounding-box 回归目标真值元组 u , v = ( v_x , v_y , v _w , v_h ) 和预测元组 tud = ( t u x , t u y , t u w , t u h )上的损失。 Iverson 括号指示函数 [ u ≥ 1 ], 当 u ≥ 1的时候值为 1 , 否则为 0 。按照惯例, 任何背景类标记为 u = 0 。对于背景 RoI, 没有检测框真值的概念, 因此 L_{l o c} 被忽略。对于检测框回归, 我们使用损失:

其中:

是鲁棒的 L₁损失, 对于异常值比在 R-CNN 和 SPPnet 中使用的 L ₂ 损失更不敏感。当回归目标无界时，具有 L2 损失的训练可能需要仔细调整学习速率，以防止爆炸梯度。公式(3)消除了这种敏感性。

The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets vi to have zero mean and unit variance. All experiments use = 1.

公式(1)中的超参数 λ 控制两个任务损失之间的平衡。我们将回归目标真值 vi归一化为具有零均值和方差为 1 的分布。所有实验都使用 λ=1。

We note that [6] uses a related loss to train a class-agnostic object proposal network. Different from our approach, [6] advocates for a twonetwork system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).

我们注意到[6]使用相关损失来训练一个类别无关的目标候选网络。与我们的方法不同的是[6]倡导一个将定位和分类分离的双网络系统。OverFeat[19]，R-CNN[9]和 SPPnet[11]也训练分类器和检测框定位器，但是这些方法使用逐级训练，这对于 Fast R-CNN 来说不是最好的选择。

Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini- batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0:5. These RoIs comprise the examples labeled with a foreground object class, i.e. u≥1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5], following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.

小批量采样。在微调期间，每个 SGD 的小批量由 N=2 个图像构成，均匀地随机选择（如通常的做法，我们实际上迭代数据集的排列）。我们使用大小为 R=128 的小批量，从每个图像采样 64 个 RoI。如在 [9]中，我们从候选框中获取 25％的 RoI，这些候选框与检测框真值的交并比 IoU 至少为 0.5。这些 RoI 只包括用前景对象类标记的样本，即 u≥1。根据[11]，剩余的 RoI 从候选框中采样，该候选框与检测框真值的最大 IoU 在区间[0.1, 0.5]。这些是背景样本，并用 u=0 标记。 0.1 的阈值下限似乎充当困难样本重挖掘的启发式算法[8]。在训练期间，图像以概率 0.5 水平翻转。不使用其他数据增强。

Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.

通过 RoI 池化层的反向传播。反向传播通过 RoI 池化层。为了清楚起见，我们假设每个小批量(N=1)只有一个图像，扩展到 N>1 是显而易见的，因为前向传播独立地处理所有图像。

Let x_i ∈ R be the i-th activation input into the RoI pooling layer and let y _{r j} be the layer’s j jj-th output from the r-th RoI. The RoI pooling layer computes y_{r j} = x_i∗ ( _{r , j}) , in which i ^∗ ( r , j ) = argmax ⁡ x _i ⋅ R ( r ; j )is the index set of inputs in the sub-window over which the output unit y_{r j} max pools. A single x_imay be assigned to several different outputs y_{r j}

令 RoI 池化层的第 i 个激活输入 x _i ∈ R , 令 y_{r j} 是第 r 个 RoI 层的第 j个输出。RoI 池化层计算 y _{r j}= x_i ∗ ( _{r , j} ) , 其中 i ^∗( r , j ) = argmax ⁡ x_i ′ .R(r; j ) 是输出单元 y_r最大池化的子窗口中的输入的索引集合。一个 x_i 可以被分配给几个不同的输出 y _{r j} 。

The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable x_i by following the argmax switches:

RoI 池化层反向传播函数通过遵循 argmax switches 来计算关于每个输入变量 x_i 的损失函数的偏导数:

In words, for each mini-batch RoI r and for each pooling output unit y _{r j} , the partial derivative ∂L/∂ y _{r j} is accumulated if i is the argmax selected for y _{r j} by max pooling. In back-propagation, the partial derivatives ∂L/∂y _{r j} are already computed by the backwards function of the layer on top of the RoI pooling layer.

换句话说，对于每个小批量 RoI r 和对于每个池化输出单元 y _{r j}，如果 i 是y _{r j} 通过最大池化选择的 argmax，则将这个偏导数∂L/∂ y _{r j} 积累下来。在反向传播中，偏导数∂L/∂y _{r j} 已经由 RoI 池化层顶部的层的反向传播函数计算。

SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initialized from zero-mean Gaussian distributions with standard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k minibatch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0:9 and parameter decay of 0.0005 (on weights and biases) are used.

SGD 超参数。用于 Softmax 分类和检测框回归的全连接层的权重分别使用具有方差 0.01 和 0.001 的零均值高斯分布初始化。偏置初始化为 0。所有层的权重学习率为 1 倍的全局学习率，偏置为 2 倍的全局学习率，全局学习率为 0.001。当对 VOC07 或 VOC12 trainval 训练时，我们进行 30k 次小批量 SGD 迭代，然后将学习率降低到 0.0001，再训练 10k 次迭代。当我们训练更大的数据集，我们运行 SGD 更多的迭代，如下文所述。使用 0.9 的动量和 0.0005 的参数衰减（权重和偏置）。

2.4 Scale invariance 尺度不变性

We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.

我们探索两种实现尺度不变目标检测的方法：（1）通过“brute force”学习和（2）通过使用图像金字塔。这些策略遵循[11]中的两种方法。在“brute force”方法中，在训练和测试期间以预定义的像素大小处理每个图像。网络必须直接从训练数据学习尺度不变性目标检测。

The multi-scale approach, in contrast, provides approximate scaleinvariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.

相反，多尺度方法通过图像金字塔向网络提供近似尺度不变性。在测试时，图像金字塔用于大致缩放-规范化每个候选框。按照[11]中的方法，作为数据增强的一种形式，在多尺度训练期间，我们在每次图像采样时随机采样金字塔尺度。由于 GPU 内存限制，我们只对较小的网络进行多尺度训练。

3. Fast R-CNN detection Fast R-CNN检测

Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are precomputed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].

一旦 Fast R-CNN 网络被微调完毕，检测相当于运行前向传播（假设候选框是预先计算的）。网络将输入图像（或图像金字塔，编码为图像列表）和待计算得分的 R 个候选框的列表作为输入。在测试的时候，R 通常在 2000 左右，尽管我们需要考虑更大（约 45k）的情况。当使用图像金字塔时，每个 RoI 被缩放，使其最接近[11]中的 2242个像素。

For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability Pr(class = k｜r) ≜ pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].

对于每个测试的 RoI r，正向传播输出类别后验概率分布 p 和相对于 r 的预测检测框偏移集合（K 个类别中的每个类别获得其自己的修正的检测框预测结果）。我们使用估计的概率 Pr(class=k|r)≜pk 为每个对象类别 k 分配 r 的检测置信度。然后，我们使用 R-CNN [9]算法的设置和算法对每个类别独立执行非极大值抑制。

3.1 Truncated SVD for faster detection 使用截断的 SVD 实现更快的检测

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].

对于整个图像进行分类任务时，与卷积层相比，计算全连接层花费的时间较小。相反，在图像检测任务中，要处理大量的 RoI，并且接近一半的前向传播时间用于计算全连接层（参见图 2）。较大的全连接层可以轻松地通过用截短的 SVD[5, 23]压缩来提升速度。

In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as

using SVD. In this factorization, U is a u × t matrix comprising the first t left-singular vectors of W , E_t is a t diagonal matrix containing the top t singular values of W, and V is v × t {v} matrix comprising the first t rightsingular vectors of W. Truncated SVD reduces the parameter count from u v u vuv to t ( u + v ) , which can be significant if t tt is much smaller than min ⁡ ( u , v ). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix E_t V^T (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

在这种技术中, 层的 u × v 权重矩阵 W 通过 SVD 被近似分解为:

在这种分解中, U是一个 u × t 的矩阵, 包括 W 的前 t个左奇异向量, E _t是 t × t 对角矩阵, 其包含 W 的前 t 个奇异值, 并且 V 是 v × t 矩阵, 包括 W 的前 t 个右奇异向量。截断 SVD 将参数计数从u v 减少到t ( u + v ) 个，如果 t 远小于 min ⁡ ( u , v ) ，则是非常有意义的。为了压缩网络，对应于 W的单个全连接层由两个全连接层替代，在它们之间没有非线性。这些层中的第一层使用权重矩阵 E _tV ^T （没有偏置），并且第二层使用 U （其中原始偏差与 W相关联）。当 RoI 的数量较大时，这种简单的压缩方法能实现很好的加速。

4. Main results 主要结果

Three main results support this paper’s contributions: 1. State-of-the-art mAP on VOC07, 2010, and 2012 2. Fast training and testing compared to R-CNN, SPPnet 3. Fine-tuning conv layers in VGG16 improves mAP

三个主要结果支持本文的贡献： 1. VOC07，2010 和 2012 的最高的 mAP。 2. 相比 R-CNN，SPPnet，训练和测试的速度更快。 3. 对 VGG16 卷积层 Fine-tuning 后提升了 mAP。

4.1 Experimental setup 实验设置

Our experiments use three pre-trained ImageNet models that are available online. The first is the CaffeNet (essentially AlexNet [14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG_CNN_M_1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use singlescale training and testing (s = 600; see Section 5.2 for details).

我们的实验使用了三个经过预训练的 ImageNet 网络模型，这些模型可以在线获得(脚注：https://github.com/BVLC/caffe/wiki/ModelZoo)。第一个是来自 R-CNN [9]的 CaffeNet（实质上是 AlexNet[14]）。我们将这个 CaffeNet 称为模型 S，即小模型。第二网络是来自[3]的 VGG_CNN_M_1024，其具有与 S 相同的深度，但是更宽。我们把这个网络模型称为 M，即中等模型。最后一个网络是来自[20]的非常深的 VGG16 模型。由于这个模型是最大的，我们称之为 L。在本节中，所有实验都使用单尺度训练和测试（s=600，详见 5.2 节）。

4.2 VOC 2010 and 2012 results VOC 2010 和 2012 数据集上的结果

On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3). For the NUS_NIN_c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.

（如上面表 2，表 3 所示）在这些数据集上，我们比较 Fast RCNN（简称 FRCN）和公共排行榜中 comp4（外部数据）上的主流方法（脚注： http://host.robots.ox.ac.uk:8080/leaderboard ）。对 于 NUS_NIN_c2000 和 BabyLearning 方法，目前没有相关的出版物，我们无法找到有关所使用的 ConvNet 体系结构的确切信息；它们是 Network-in-Network 的变体 16。所有其他方法都通过相同的预训练 VGG16 网络进行了初始化。

Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.

Fast R-CNN 在 VOC12 上获得最高结果，mAP 为 65.7％（加上额外数据为 68.4％）。它也比其他方法快两个数量级，这些方法都基于比较“慢”的 R-CNN 网络。在 VOC10 上，SegDeepM [25]获得了比 Fast R-CNN 更高的 mAP（67.2％对比 66.1％）。SegDeepM 使用 VOC12 trainval 训练集及分割标注进行了训练，它被设计为通过使用马尔可夫随机场推理 R-CNN 检测和来自 O2P [1]的语义分割方法的分割来提高R-CNN精度。Fast R-CNN可以替换SegDeepM中使用的R-CNN，这可以获得更好的结果。当使用扩大的 07++12 训练集（见表 2 标题）时，Fast R-CNN 的 mAP 增加到 68.8％，超过 SegDeepM。

4.3 VOC 2007 results VOC 2007 数据集上的结果

On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). RCNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.

在 VOC07 数据集上，我们比较 Fast R-CNN 与 R-CNN 和 SPPnet 的 mAP。所有方法从相同的预训练 VGG16 网络开始，并使用 bounding-box回归。VGG16 SPPnet结果由论文[11]的作者提供。SPPnet 在训练和测试期间使用五个尺度。Fast R-CNN 对 SPPnet 的改进说明，即使 Fast R-CNN 使用单个尺度训练和测试，卷积层 fine-tuning 在 mAP 中贡献了很大的改进（从 63.1％到 66.9％）。R-CNN 的 mAP 为 66.0％。其次，SPPnet 是在 PASCAL 中没有被标记为“困难”的样本上进行了训练。除去这些样本，Fast R-CNN 的 mAP 达到 68.1％。所有其他实验都使用被标记为“困难”的样本。

4.4 Training and testing time 训练和测试时间

Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.

快速的训练和测试是我们的第二个主要成果。表 4 比较了 Fast RCNN，R-CNN 和 SPPnet 之间的训练时间（单位小时），测试速率（每秒图像数）和 VOC07 上的 mAP。对于 VGG16，没有截断 SVD 的 Fast R-CNN 处理图像比 R-CNN 快 146 倍，有截断 SVD 的 R-CNN 快 213 倍。训练时间减少 9 倍，从 84 小时减少到 9.5 小时。与 SPPnet 相比，没有截断 SVD 的 Fast RCNN 训练 VGG16 网络比 SPPnet 快 2.7 倍（9.5 小时相比于 25.5 小时），测试时间快 7 倍，有截断 SVD 的 Fast RCNN 比的 SPPnet 快 10 倍。Fast R-CNN 还不需要数百 GB 的磁盘存储，因为它不缓存特征。

Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088×4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096 ×4096 fc7 layer reduces runtime with little loss in mAP. Further speedups are possible with smaller drops in mAP if one fine-tunes again after compression.

截断的 SVD。截断的 SVD 可以将检测时间减少 30％以上，同时能保持 mAP 只有很小（0.3 个百分点）的下降，并且无需在模型压缩后执行额外的 fine-tuning。图 2 显示了如何使用来自 VGG16 的 fc6 层中的 25088×4096 矩阵的顶部 1024 个奇异值和来自 fc7 层的 4096×4096 矩阵的顶部 256 个奇异值减少运行时间，而 mAP 几乎没有损失。如果在压缩之后再次微调，则可以在 mAP 更小下降的情况下进一步提升速度。

4.5. Which layers to fine-tune? fine-tune 哪些层？

For the less deep networks considered in the SPPnet paper [11], finetuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets.

对于 SPPnet 论文[11]中提到的不太深的网络，仅 fine-tuning 全连接层似乎足以获得良好的准确度。我们假设这个结果不适用于非常深的网络。为了验证 fine-tuning 卷积层对于 VGG16 的重要性，我们使用 Fast R-CNN 进行 fine-tuning，但冻结十三个卷积层，以便只有全连接层学习。这种消融模拟了单尺度 SPPnet 训练，将 mAP 从 66.9％降低到 61.4％（如表 5 所示）。这个实验验证了我们的假设：通过 RoI 池化层的训练对于非常深的网是重要的。

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3_1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2_1 slows training by 1.3 × (12.5 vs. 9.5 hours) compared to learning from conv3_1; and (2) updating from conv1_1 over-runs GPU memory. The difference in mAP when learning from conv2_1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3_1 and up; all experiments with models S and M fine-tune layers conv2 and up.

这是否意味着所有卷积层应该进行 fine-tune？简而言之，不是的。在较小的网络（S 和 M）中，我们发现 conv1是通用的、不依赖于特定任务的（一个众所周知的事实[14]）。允许 conv1 学习或不学习，对 mAP 没有很关键的影响。对于 VGG16，我们发现只需要更新 conv3_1 及以上（13 个卷积层中的 9 个）的层。这个观察结果是实用的：（1）与从 conv3_1 更新相比，从 conv2_1 更新使训练变慢 1.3 倍（12.5 小时对比 9.5 小时），（2）从 conv1_1 更新时 GPU 内存不够用。从 conv2_1 学习时 mAP 仅增加 0.3 个点（如表 5 最后一列所示）。本文中所有 Fast R-CNN 的结果都 fine-tune VGG16 conv3_1 及以上的层，所有用模型 S 和 M 的实验 fine-tune conv2 及以上的层。

5. Design evaluation 设计评估

We conducted experiments to understand how Fast R-CNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.

我们通过实验来了解 Fast RCNN 与 R-CNN 和 SPPnet 的比较，以及评估设计决策。按照最佳实践，我们在 PASCAL VOC07 数据集上进行了这些实验。

5.1 Does multi-task training help? 多任务训练有用吗？

Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?

多任务训练是方便的，因为它避免管理顺序训练任务的 pipeline。但它也有可能改善结果，因为任务通过共享的表示（ConvNet）[2]相互影响。多任务训练能提高 Fast R-CNN 中的目标检测精度吗？

To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ= 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ=1), but we disable bounding-box regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.

为了测试这个问题，我们训练仅使用公式(1)中的分类损失 Lcls（即设置 λ=0）的基准网络。这些 baselines 是表 6 中每组的第一列。请注意，这些模型没有 bounding-box 回归器。接下来（每组的第二列），是我们采用多任务损失（公式(1)，λ=1）训练的网络，但是我们在测试时禁用 bounding-box 回归。这隔离了网络的分类准确性，并允许与基准网络的相似类别之类的比较（译者注： apples-to-apples comparision 意思是比较两个相同类别的事或物）。

Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.

在所有三个网络中，我们观察到多任务训练相对于单独的分类训练提高了纯分类准确度。改进范围从+0.8 到+1.1 个 mAP 点，显示了多任务学习的一致的积极效果。

Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Lloc while keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).

最后，我们采用 baseline 模型（仅使用分类损失进行训练），加上 bounding-box 回归层，并使用 Lloc 训练它们，同时保持所有其他网络参数冻结。每组中的第三列显示了这种逐级训练方案的结果：mAP 相对于第一列有改进，但逐级训练表现不如多任务训练（每组第四列）。

5.2 Scale invariance: to brute force or finesse? 尺度不变性：暴力或精细？

We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multiscale). In either case, we define the scale s of an image to be the length of its shortest side.

我们比较两个策略实现尺度不变物体检测：暴力学习（单尺度）和图像金字塔（多尺度）。在任一情况下，我们将尺度 s 定义为图像短边的长度。

All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.

所有单尺度实验使用 s=600 像素，对于一些图像，s 可以小于 600，因为我们保持横纵比缩放图像，并限制其最长边为 1000 像素。选择这些值使得 VGG16 在 fine-tune 期间不至于 GPU 内存不足。较小的模型占用显存更少，所以可受益于较大的 s 值。然而，每个模型的优化不是我们的主要的关注点。我们注意到 PASCAL 图像平均大小是 384×473 像素的，因此单尺度设置通常以 1.6 的倍数对图像进行上采样。因此，RoI 池化层的平均有效步长约为 10 像素。

In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.

在多尺度模型配置中，我们使用[11]中指定的相同的五个尺度（s ∈{480,576,688,864,1200}），以方便与 SPPnet 进行比较。但是，我们限制长边最大为 2000 像素，以避免 GPU 内存不足。

Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that singlescale detection performs almost as well as multi-scale detection. Our findings confirm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Yet it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.

表 7 显示了当使用一个或五个尺度进行训练和测试时的模型 S 和 M 的结果。也许在[11]中最令人惊讶的结果是单尺度检测几乎与多尺度检测一样好。我们的研究结果能证明他们的结果：深度卷积网络擅长直接学习到尺度的不变性。多尺度方法消耗大量的计算时间仅带来了很小的 mAP 提升（表 7）。在 VGG16（模型 L）的情况下，我们实现细节限制而只能使用单个尺度。然而，它得到了 66.9％的 mAP，略高于 R-CNN[10]的 66.0％，尽管 R-CNN 在某种意义上使用了“无限”尺度，但每个候选区域还是被缩放为规范大小。

Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.

由于单尺度处理能够权衡好速度和精度之间的关系，特别是对于非常深的模型，本小节以外的所有实验使用单尺度 s=600 像素的尺度进行训练和测试。

5.3 Do we need more training data? 我们需要更过训练数据吗？

A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.

当提供更多的训练数据时，好的目标检测器应该会进一步提升性能。Zhu 等人[24]发现 DPM [8]模型的 mAP 在只有几百到几千个训练样本的时候就达到饱和了。实验中我们增加 VOC07 trainval 训练集与 VOC12 trainval 训练集，大约增加到三倍的图像使其数量达到 16.5k，以评估 Fast R-CNN。扩大训练集将 VOC07 测试集的 mAP 从 66.9％提高到 70.0％（表 1）。使用这个数据集进行训练时，我们使用 60k 次小批量迭代而不是 40k。

We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.

我们对VOC2010和2012进行类似的实验，我们用VOC07 trainval、 test 和 VOC12 trainval 数据集构造了 21.5k 个图像的数据集。当用这个数据集训练时，我们使用 100k 次 SGD 迭代，并且每 40k 次迭代（而不是每 30k 次）降低学习率 10 倍。对于 VOC2010 和 2012，mAP 分别从 66.1％提高到 68.8％和从 65.7％提高到 68.4％。

5.4 Do SVMs outperform softmax? SVM 分类是否优于 Softmax？

Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in RCNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.

Fast R-CNN 在 fine-tune 期间使用 softmax 分类器学习，而不是像 R-CNN 和 SPPnet 在最后训练一对多线性 SVM。为了理解这种选择的影响，我们在 Fast R-CNN 中进行了具有难负采样的事后 SVM 训练。我们使用与 R-CNN 中相同的训练算法和超参数。

Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.

如表 8 所示，对于所有三个网络，Softmax 略优于 SVM，mAP 分别提高了 0.1 和 0.8 个点。这个提升效果很小，但是它表明与先前的多级训练方法相比，“一次性”fine-tune 是足够的。我们注意到，不像一对多的 SVM 那样，Softmax 会在计算 RoI 得分时引入类别之间的竞争。

5.5 Are more proposals always better? 更多的候选区域更好吗？

There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposal-classifier cascade also improves Fast R-CNN accuracy.

（广义上）存在两种类型的目标检测器：一类是使用候选区域稀疏集合检测器（例如，selective search [21]）和另一类使用密集集合（例如 DPM [8]）。分类稀疏候选区域通过一种级联方式[22]的，其中候选机制首先舍弃大量候选区域，留下较小的集合让分类器来评估。当应用于 DPM 检测时，这种级联的方式提高了检测精度[21]。我们发现 proposal-classifier 级联方式也提高了 Fast R-CNN 的精度。

Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and re-testing model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP.

使用 selective search 的质量模式，我们对每个图像扫描 1k 到 10k 个候选框，每次重新训练和重新测试模型 M。如果候选框纯粹扮演计算的角色，增加每个图像的候选框数量不会影响 mAP。

We find that mAP rises and then falls slightly as the proposal count increases (Fig. 3, solid blue line). This experiment shows that swamping the deep classifier with more proposals does not help, and even slightly hurts, accuracy.

我们发现随着候选区域数量的增加，mAP 先上升然后略微下降（如图 3 蓝色实线所示）。这个实验表明，深度神经网络分类器使用更多的候选区域没有帮助，甚至稍微有点影响准确性。

This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.

如果不实际进行实验，这个结果很难预测。用于评估候选区域质量的最流行的技术是平均召回率(Average Recall, AR) [12]。当对每个图像使用固定数量的候选区域时，AR 与使用 R-CNN 的几种候选区域方法时的 mAP 具有良好的相关性。图 3 表明 AR（红色实线）与 mAP 不相关，因为每个图像的候选区域数量是变化的。AR 必须谨慎使用，由于更多的候选区域会得到更高的 AR，然而这并不意味着 mAP 也会增加。幸运的是，使用模型 M 的训练和测试需要不到 2.5 小时。因此，Fast R-CNN 能够高效地、直接地评估目标候选区域的 mAP，是很合适的代理指标。

We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle).

我们还研究了当使用密集生成框（在不同缩放尺度、位置和宽高比上）大约 45k 个框/图像比例时的 Fast R-CNN 网络模型。这个密集集足够大，当每个 selective search 框被其最近（IoU）密集框替换时， mAP 只降低 1 个点（到 57.7％，如图 3 蓝色三角形所示）。

The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000×{2,4,6,8,10,32,45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.

密集框的统计信息与 selective search 框的统计信息不同。从 2k 个 selective search 框开始，我们再从 1000×{2,4,6,8,10,32,45}中随机添加密集框，并测试 mAP。对于每个实验，我们重新训练和重新测试模型 M。当添加这些密集框时，mAP 比添加更多选择性搜索框时下降得更强，最终达到 53.0％。

We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).

我们还训练和测试了 Fast R-CNN 只使用密集框（45k/图像）。此设置的 mAP 为 52.9％（蓝色菱形）。最后，我们检查是否需要使用难样本重训练的 SVM 来处理密集框分布。SVM 结果更糟糕：49.3％（蓝色圆圈）。

5.6 Preliminary MS COCO results MS COCO 初步结果

We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP, which also averages over IoU thresholds, is 19.7%.

我们将 Fast R-CNN（使用 VGG16）应用于 MS COCO 数据集[18]，以建立初始 baseline。我们在 80k 图像训练集上进行了 240k 次迭代训练，并使用评估服务器对“test-dev”数据集进行评估。PASCAL 形式的 mAP 为 35.9％；新的 COCO 标准下的 AP 为 19.7％，即超过 IoU 阈值的平均值。

6. Conclusion 结论

This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-the-art detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.

本文提出 Fast R-CNN，一个对 R-CNN 和 SPPnet 更新的简洁、快速版本。除了报告目前最先进的检测结果之外，我们还提供了详细的实验，希望提供新的思路。特别值得注意的是，稀疏目标候选区域似乎提高了检测器的质量。过去这个问题代价太大（在时间上）而一直无法深入探索，但 Fast R-CNN 使其变得可能。当然，可能存在未发现的技术，使得密集框能够达到与稀疏候选框类似的效果。如果这样的方法被开发出来，则可以帮助进一步加速目标检测。

深度学习论文阅读目标检测篇（二）：Fast R-CNN《Fast R-CNN》