-
Notifications
You must be signed in to change notification settings - Fork 22
/
mmdet.tex
3390 lines (2859 loc) · 167 KB
/
mmdet.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[UTF8]{ctexart}
\usepackage[colorlinks=true]{hyperref}
\usepackage{amsmath, bm,amsfonts}
\usepackage{hyperref}
\usepackage[normalem]{ulem}
% \usepackage{enumitem}
% \setlist{nosep}
\usepackage{caption}
\usepackage{graphicx}
% \graphicspath{{./pic/}}
\usepackage[usenames, dvipsnames]{xcolor}
\usepackage{listings}
\input{setting_list}
\usepackage{hologo}
\usepackage{subfigure}
\usepackage{changepage}
\ctexset{
section = {
titleformat = \raggedright,
name = {第,节},
number = \chinese{section}
}
}
\title{mmdet简略解析}
\author{Sisyphes}
\begin{document}
\maketitle
\tableofcontents
\newpage
\section{一些废话}
拖拖拉拉,勉强走完,才发现,这才是开始。一眼望去,很多地方都比较粗糙,那些需要用实验丰富的地方,时间肯定是被偷了。
虽然不排除存在对初学者有帮助的地方,但真正有帮助的或许是mmsdet(目前不存在),一个简单,轻量,高性能的检测库。
\section{结构设计}
19年7月,Kai Chen等人写了一篇文章
\href{https://arxiv.org/pdf/1906.07155.pdf}{MMDetection},介绍了他们在
\href{https://github.com/open-mmlab/mmdetection}{mmdetection}上的一些工作。
包括mmdetection的设计逻辑,已实现的算法等。
猜:Kai Chen在不知道经历了一些什么之后,觉得对各种实现迥异的检测算法抽象一些公共的组件出来也许是一件不错的事。这里尝试对代码做一些简单的解析,见下。
\\
\noindent 组件设计:
\begin{itemize}
\item BackBone:特征提取骨架网络,ResNet,ResneXt,ssd\_vgg, hrnet等。
\item Neck: 连接骨架和头部.多层级特征融合,FPN,BFP,PAFPN等。
\item DenseHead:处理特征图上的密集框部分, 主要分AnchorHead。 AnchorFreeHead两大类,分别有RPNHead, SSDHead,RetinaHead和FCOSHead等。
\item RoIExtractor:对特征图上的预选框做pool得到大小统一的roi。
\item RoIHead (BBoxHead/MaskHead):在特征图上对roi做类别分类或位置回归等(1.x)。
\item ROIHead:bbox或mask的roi\_extractor+head(2.0,合并了extractor和head)
\item OneStage: BackBone + Neck + DenseHead
\item TwoStage: BackBone + Neck + DenseHead + RoIExtractor + RoIHead : 1.x
\item TwoStage: BackBone + Neck + DenseHead + RoIHead(2.0)
\end{itemize}
\noindent 代码结构:
\begin{adjustwidth}{0.5cm}{0cm}
configs 网络组件结构等配置信息\\
tools:训练和测试的最终包装和一些实用脚本\\
mmdet:
\begin{adjustwidth}{0.5cm}{0cm}
apis: 分布式环境设定(1.x,2.0移植到mmcv),推断,测试,训练基础代码\\
core: anchor生成,bbox,mask编解码,变换,标签锚定,采样等,模型评估,加速,优化器,后处理等\\
datasets:coco,voc等数据类,数据pipelines的统一格式,数据增强,数据采样\\
models:模型组件(backbone,head,loss,neck),采用注册和组合构建的形式完成模型搭建\\
ops:优化加速代码,包括nms,roialign,dcn,masked\_conv,focal\_loss等\\
\end{adjustwidth}
\end{adjustwidth}
\begin{figure}[htbp]
\centering
\begin{minipage}[t]{0.48\textwidth}
\centering
\includegraphics[width=5cm, height=3cm]{./pic/mmdetect.png}
\caption{ Framework }
\end{minipage}
\begin{minipage}[t]{0.48\textwidth}
\centering
\includegraphics[width=5cm,height=3cm]{./pic/mmdetect_pipe.png}
\caption{Trainning pipeline}
\label{trainpipe_pic}
\end{minipage}
\end{figure}
% \newpage
\subsection{总体逻辑}
从tools/train.py中能看到整体可分如下4个步骤:\\
1. mmcv.Config.fromfile从配置文件解析配置信息,并做适当更新,包括环境搜集,预加载模型文件,分布式设置,日志记录等
\noindent 2. mmdet.models中的build\_detector根据配置信息构造模型
\begin{adjustwidth}{0.5cm}{0cm}
2.1 build系列函数调用build\_from\_cfg函数,按type关键字从注册表中获取相应的
对象,对象的具名参数在注册文件中赋值。\\
2.2 registr.py放置了模型的组件注册器。其中注册器的register\_module成员函数是一
个装饰器功能函数,在具体的类对象$A$头上装饰@X.register
\_module,并同时在$A$对象所在包的初始化文件中调用$A$,即可将$A$保存
到registry.module\_dict中,完成注册。\\
2.3 目前包含BACKBONES,NECKS,ROI\_EXTRACTORS,SHARED\_
HEADS,HEADS,LOSSES,DETECTORS七个模型相关注册器,另外还有数据类,优化器等注册器。\\
\end{adjustwidth}
\noindent 3. build\_dataset根据配置信息获取数据类
\begin{adjustwidth}{0.5cm}{0cm}
3.1 coco,cityscapes,voc,wider\_face等数据(数据类扩展见后续例子)。\\
\end{adjustwidth}
\noindent 4.train\_detector模型训练流程
\begin{adjustwidth}{0.5cm}{0cm}
4.1 数据loader化,模型分布式化,优化器选取\\
4.2 进入runner训练流程(来自mmcv库,采用hook方式,整合了pytorch训练流程)\\
4.2 训练pipelines可见\ref{trainpipe_pic},具体细节见后续展开。\\
\end{adjustwidth}
\noindent 后续说说配置文件,注册机制和训练逻辑。
\section{配置,注册}
\subsection{配置类}
配置方式支持python/json/yaml,从mmcv的Config解析,其功能同maskrcnn-benchmark的yacs类似,将字典的取值方式属性化.这里帖部分代码,以供学习。
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
class Config(object):
...
@staticmethod
def _file2dict(filename):
filename = osp.abspath(osp.expanduser(filename))
check_file_exist(filename)
if filename.endswith('.py'):
with tempfile.TemporaryDirectory() as temp_config_dir:
shutil.copyfile(filename,
osp.join(temp_config_dir, '_tempconfig.py'))
sys.path.insert(0, temp_config_dir)
mod = import_module('_tempconfig')
sys.path.pop(0)
cfg_dict = {
name: value
for name, value in mod.__dict__.items()
if not name.startswith('__')
}
# delete imported module
del sys.modules['_tempconfig']
elif filename.endswith(('.yml', '.yaml', '.json')):
import mmcv
cfg_dict = mmcv.load(filename)
else:
raise IOError('Only py/yml/yaml/json type are supported now!')
cfg_text = filename + '\n'
with open(filename, 'r') as f:
cfg_text += f.read()
# 2.0新增的配置文件的组合继承
if '_base_' in cfg_dict:
cfg_dir = osp.dirname(filename)
base_filename = cfg_dict.pop('_base_')
base_filename = base_filename if isinstance(
base_filename, list) else [base_filename]
cfg_dict_list = list()
cfg_text_list = list()
for f in base_filename:
# 递归,可搜索staticmethod and recursion
# 静态方法调静态方法,类方法调静态方法
_cfg_dict, _cfg_text = Config._file2dict(osp.join(cfg_dir, f))
cfg_dict_list.append(_cfg_dict)
cfg_text_list.append(_cfg_text)
base_cfg_dict = dict()
for c in cfg_dict_list:
if len(base_cfg_dict.keys() & c.keys()) > 0:
raise KeyError('Duplicate key is not allowed among bases')
base_cfg_dict.update(c)
# 合并
Config._merge_a_into_b(cfg_dict, base_cfg_dict)
cfg_dict = base_cfg_dict
# merge cfg_text
cfg_text_list.append(cfg_text)
cfg_text = '\n'.join(cfg_text_list)
return cfg_dict, cfg_text
...
# 获取key值
def __getattr__(self, name):
return getattr(self._cfg_dict, name)
# 序列化
def __getitem__(self, name):
return self._cfg_dict.__getitem__(name)
# 将字典属性化主要用了__setattr__
def __setattr__(self, name, value):
if isinstance(value, dict):
value = ConfigDict(value)
self._cfg_dict.__setattr__(name, value)
# 更新key值
def __setitem__(self, name, value):
if isinstance(value, dict):
value = ConfigDict(value)
self._cfg_dict.__setitem__(name, value)
# 迭代器
def __iter__(self):
return iter(self._cfg_dict)
\end{lstlisting}
主要考虑点是自己怎么实现类似的东西,核心点就是python的基本魔法函数的应用,可同时参考yacs。
\subsection{注册器}
把基本对象放到一个继承了字典的对象中,实现了对象的灵活管理。
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
import inspect
from functools import partial
import mmcv
class Registry(object):
# 2.0 放到mmcv中
def __init__(self, name):
self._name = name
self._module_dict = dict()
@property
def name(self):
return self._name
@property
def module_dict(self):
return self._module_dict
def get(self, key):
return self._module_dict.get(key, None)
def _register_module(self, module_class, force=False):
"""Register a module.
Args:
module (:obj:`nn.Module`): Module to be registered.
"""
if not inspect.isclass(module_class):
raise TypeError('module must be a class, but got {}'.format(
type(module_class)))
module_name = module_class.__name__
if not force and module_name in self._module_dict:
raise KeyError('{} is already registered in {}'.format(
module_name, self.name))
self._module_dict[module_name] = module_class # 类名:类
def register_module(self, cls=None, force=False):
# 作为类cls的装饰器
if cls is None:
# partial函数(类)固定参数,返回新对象,递归不是很清楚
return partial(self.register_module, force=force)
self._register_module(cls, force=force) # 将cls装进当前Registry对象的中_module_dict
return cls # 返回类
def build_from_cfg(cfg, registry, default_args=None):
assert isinstance(cfg, dict) and 'type' in cfg
assert isinstance(default_args, dict) or default_args is None
args = cfg.copy()
obj_type = args.pop('type')
if mmcv.is_str(obj_type):
# 从注册类中拿出obj_type类
obj_cls = registry.get(obj_type)
if obj_cls is None:
raise KeyError('{} is not in the {} registry'.format(
obj_type, registry.name))
elif inspect.isclass(obj_type):
obj_cls = obj_type
else:
raise TypeError('type must be a str or valid type, but got {}'.format(
type(obj_type)))
if default_args is not None:
# 增加一些新的参数
for name, value in default_args.items():
args.setdefault(name, value)
return obj_cls(**args) # **args是将字典解析成位置参数(k=v)。
\end{lstlisting}
\section{数据处理}
\label{sec:detail}
数据处理可能是炼丹师接触最为密集的了,因为通常情况,除了数据的离线处理,写个数据类,就可以炼丹了。本节主要涉及数据的在线处理,
更进一步应该是检测分割数据的pytorch处理方式。虽然mmdet将常用的数据都实现了,而且也实现了中间通用数据格式,但,这和模型,损失函数,
性能评估的实现也相关,比如你想把官网的centernet完整的改成mmdet风格,就能看到(看起来没必要)。
Data processing may be the most intensive contact with alchemy masters, because usually, in addition to offline
data processing, write a data class, you can alchemy. This section mainly deals with online processing of data,
and further should be the pytorch processing method for detecting segmented data. Although mmdet implements all
commonly used data and also implements the intermediate general data format, this is also related to the implementation
of models, loss functions, and performance evaluation, so you want to completely rewrite an algorithm implemented by a
third party as There are still a lot of changes to the mmdet style. For example, if you want to rewrite the official centernet,
you can see it (although it seems unnecessary).
\subsection{检测分割数据}
看看配置文件,数据相关的有data dict,里面包含了train,val,test的路径信息,用于数据类初始化,有pipeline,
将各个函数及对应参数以字典形式放到列表里,是对pytorch原装的transforms+compose,在检测,分割相关数据上的一次封装,使得形式更加统一。
Take a look at the configuration file, the data is related, data dict, which contains the path information of train,
val, test, used for data class initialization, pipeline, put each function and corresponding parameters into the list
in the form of a dictionary,It is a encapsulation of the original transformations+compose of pytorch in the detection
and segmentation of relevant data, making the form more uniform.
从builder.py中build\_dataset函数能看到,构建数据有三种方式,ConcatDataset,RepeatDataset和从注册器中提取。
其中dataset\_wrappers.py中ConcatDataset和RepeatDataset意义自明,前者继承自pytorch原始的ConcatDataset,
将多个数据集整合到一起,具体为把不同序列(可参考\href{https://docs.python.org/zh-cn/3/library/collections.abc.html}
{容器的抽象基类})的长度相加,\_\_getitem\_\_函数对应index替换一下。
后者就是单个数据类(序列)的多次重复。就功能来说,前者提高数据丰富度,后者可解决数据太少使得loading时间长的问题(见代码注释)。
而被注册的数据类在datasets下一些熟知的数据名文件中。其中,基类为custom.py中的CustomDataset,coco继承自它,
cityscapes继承自coco,xml\_style的XMLDataset继承CustomDataset,然后wider\_face,voc均继承自XMLDataset。
因此这里先分析一下CustomDataset。
You can see from the build\_dataset function in builder.py that there are three ways to build data, ConcatDataset,
RepeatDataset and extract from the registrar. Among them, the meaning of ConcatDataset and RepeatDataset
in dataset\_wrappers.py is self-evident. The former inherits from the original ConcatDataset of pytorch and integrates
multiple data sets together, specifically for different sequences
(refer to \href{https://docs.python.org /zh-cn/3/library/collections.abc.html}{abstract base class of container}),
the length of the \_\_getitem\_\_ function is replaced by the corresponding index. The latter is multiple repetitions
of a single data class (sequence). In terms of functionality, the former improves data richness, and the latter can solve
the problem of too little data making the loading time long (see code comments). The registered data classes are in
some well-known data name files under datasets. Among them, the base class is CustomDataset in custom.py, coco inherits
from it, cityscapes inherits from coco, xml\_style XMLDataset inherits CustomDataset, then wider\_face, voc all inherit
from XMLDataset.Therefore, first analyze the CustomDataset here.
CustomDataset 记录数据路径等信息,解析标注文件,将每一张图的所有信息以字典作为数据结构存在results中,然后进入pipeline:
数据增强相关操作,代码如下:
CustomDataset records information such as the data path, parses the annotation file, stores all the information of
each picture in the results as a data structure in the dictionary, and then enters the pipeline for data enhancement
related operations, the code is as follows:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
self.pipeline = Compose(pipeline)
# Compose是实现了__call__方法的类,其作用是使实例能够像函数一样被调用,同时不影响实例本身的生命周期
def pre_pipeline(self, results):
# 扩展字典信息
results['img_prefix'] = self.img_prefix
results['seg_prefix'] = self.seg_prefix
results['proposal_file'] = self.proposal_file
results['bbox_fields'] = []
results['mask_fields'] = []
results['seg_fields'] = []
def prepare_train_img(self, idx):
img_info = self.img_infos[idx]
ann_info = self.get_ann_info(idx)
# 基本信息,初始化字典
results = dict(img_info=img_info, ann_info=ann_info)
if self.proposals is not None:
results['proposals'] = self.proposals[idx]
self.pre_pipeline(results)
return self.pipeline(results) # 数据增强
def __getitem__(self, idx):
if self.test_mode:
return self.prepare_test_img(idx)
while True:
data = self.prepare_train_img(idx)
if data is None:
idx = self._rand_another(idx)
continue
return data
\end{lstlisting}
这里数据结构的选取需要注意一下,字典结构,在数据增强库albu中也是如此处理,因此可以快速替换为albu中的算法。另外每个数据类增加了
各自的evaluate函数。evaluate基础函数在mmdet.core.evaluation中,后做补充。
The dictionary structure is also handled in the data enhancement library albu, so it can be quickly replaced with the enhancement algorithm in the albu.
In addition, each data class adds its own evaluate function. The basic function of evaluate is in mmdet.core.evaluation,
which will be added later.
mmdet的数据处理,\textbf{字典结构},\textbf{pipeline},\textbf{evaluate}是三个关键部分。其他所有类的文件解析部分,数据筛选等,看看即可。
因为我们知道,pytorch读取数据,是将序列转化为迭代器后进行io操作,所以在dataset下除了pipelines外还有loader文件夹,里面实现了分组,
分布式分组采样方法,以及调用了mmcv中的collate函数(此处为1.x版本,2.0版本将loader移植到了builder.py中),且build\_dataloader封装的
DataLoader最后在train\_detector中被调用,这部分将在后面补充,这里说说pipelines。
The data processing of mmdet, \textbf{dictionary structure}, \textbf{pipeline}, \textbf{evaluate} are three key parts.
All other types of file analysis, data filtering, etc., just look. We know that pytorch reads data, converts the sequence
into an iterator, and then performs io operations. Therefore, in addition to pipelines, a loader is required. The loader
implements sequence iteration, including grouping, distributed group sampling, collate\_fn, etc., see builder\.py file,
the DataLoader encapsulated by its build\_dataloader is finally called in train\_detector, this part will be added later,
here to talk about pipelines.
返回maskrcnn的配置文件(1.x,2.0看base config),可以看到训练和测试的不同之处:LoadAnnotations,MultiScaleFlipAug,
DefaultFormatBundle和Collect。额外提示,虽然测试没有LoadAnnotations,根据CustomDataset可知,它仍需标注文件,
这和inference的pipeline不同,也即这里的test实为evaluate。
Return to the maskrcnn configuration file (1.x, 2.0 see base config), you can see the difference between training and testing: LoadAnnotations,
MultiScaleFlipAug, DefaultFormatBundle and Collect. Extra tip, although the test does not have LoadAnnotations, according to CustomDataset,
it still needs to annotate the file, which is different from the pipeline of inference, that is, the test here is actually evaluated.
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
# 序列中的dict可以随意删减,增加,属于数据增强调参内容
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
\end{lstlisting}
最后这些所有操作被Compose串联起来,代码如下:
Finally, all these operations are connected by Compose, the code is as follows:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
@PIPELINES.register_module
class Compose(object):
def __init__(self, transforms):
assert isinstance(transforms, collections.abc.Sequence) # 列表是序列结构
self.transforms = []
for transform in transforms:
if isinstance(transform, dict):
transform = build_from_cfg(transform, PIPELINES)
self.transforms.append(transform)
elif callable(transform):
self.transforms.append(transform)
else:
raise TypeError('transform must be callable or a dict')
def __call__(self, data):
for t in self.transforms:
data = t(data)
if data is None:
return None
return data
\end{lstlisting}
上面代码能看到,配置文件中pipeline中的字典传入build\_from\_cfg函数,逐一实现了各个增强类(方法)。
扩展的增强类均需实现\_\_call\_\_方法,这和pytorch原始方法是一致的。
The above code can see that the dictionary in the pipeline in the configuration file
is passed into the build\_from\_cfg function, and each enhanced class (method) is
implemented one by one. The extended enhancement classes all need to implement the
\_\_call\_\_ method, which is consistent with the original method of pytorch.
有了以上认识,重新梳理一下pipelines的逻辑,由三部分组成,load,transforms,和format。
load相关的LoadImageFromFile,LoadAnnotations都是字典results进去,字典results出来。具体代码看下便知,
LoadImageFromFile增加了'filename','img','img\_shape','ori\_shape','pad\_shape',
'scale\_factor','img\_norm\_cfg'字段。其中img是numpy格式。LoadAnnotations从results['ann\_info']中解析出bboxs,masks,labels等信
息。注意coco格式的原始解析来自pycocotools,包括其评估方法,这里关键是字典结构(这个和模型损失函数,评估等相关,统一结构,使得代码统一)。
transforms中的类作用于字典的values,也即数据增强。format中的DefaultFormatBundle是将数据转成mmcv扩展的容器类格式DataContainer。
另外Collect会根据不同任务的不同配置,从results中选取只含keys的信息生成新的字典,具体看下该类帮助文档。
这里看一下从numpy转成tensor的代码:
With the above understanding, we reorganize the logic of pipelines, which consists of three parts, load, transforms,
and format. Load-related LoadImageFromFile, Load Annotations are all dictionary in, dictionary out. LoadImageFromFile
added 'filename,'img','img$\_$shape','ori$\_s$hape','pad$\_$s \newline
hape','scale$\_$factor','img$\_$norm$\_$cfg' fields.
Where img is in numpy format. LoadAnnotations parses out bboxs, masks, labels and other information from results['ann\_info'].
Note that the original analysis of the coco format comes from pycocotools, including its evaluation method. The key here is
the dictionary structure (this is related to the model loss function, evaluation, etc., and the unified structure makes the
code unified). The classes in transforms act on dictionary values, that is, data enhancement. The DefaultFormatBundle in the
format is the container class format DataContainer that converts the data into mmcv extension. In addition, Collect will select
the information containing only the keys from the results to generate a new dictionary according to the different configurations
of different tasks. For details, see the help documentation of this type. Here is a look at the code from numpy to tensor:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
def to_tensor(data):
"""Convert objects of various python types to :obj:`torch.Tensor`.
Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
:class:`Sequence`, :class:`int` and :class:`float`.
"""
if isinstance(data, torch.Tensor):
return data
elif isinstance(data, np.ndarray):
return torch.from_numpy(data)
elif isinstance(data, Sequence) and not mmcv.is_str(data):
return torch.tensor(data)
elif isinstance(data, int):
return torch.LongTensor([data])
elif isinstance(data, float):
return torch.FloatTensor([data])
else:
raise TypeError('type {} cannot be converted to tensor.'.format(
type(data)))
以上代码告诉我们,基本数据类型,需掌握。
\end{lstlisting}
那么DataContainer是什么呢?它是对tensor的封装,将results中的tensor转成DataContainer格式,实际上只是增加了几个property函数,
cpu\_only,stack,padding\_value,pad\_dims,其含义自明,以及size,dim用来获取数据的维度,形状信息。
考虑到序列数据在进入DataLoader时,需要以batch方式进入模型,那么通常的collate\_fn会要求tensor数据的形状一致。但是这样不是很方便,
于是有了DataContainer。它可以做到载入GPU的数据可以保持统一shape,并被stack,也可以不stack,也可以保持原样,或者在非batch维度上做pad。
当然这个也要对default\_collate进行改造,mmcv在parallel.collate中实现了这个。
So what is DataContainer? It is the encapsulation of tensor, which converts the tensor in the results to DataContainer format.
In fact, it only adds a few property functions.cpu\_only, stack, padding\_value, pad\_dims, meaning self-explanatory, and size
and dim are used to obtain the dimension and shape information of the data. Considering that the sequence data needs to enter
the model in batch mode when entering the DataLoader, then the usual collate\_fn will require the shape of the tensor data to
be consistent. But this is not very convenient, DataContainer does not need this restriction. It can achieve the uniform shape
of the data loaded into the GPU and be stacked, or not stacked, as it is, or as a pad in a non-batch dimension. Of course, this
also needs to rewrite the default\_collate function, mmcv realizes this in parallel.collate.
collate\_fn是DataLoader中将序列dataset组织成batch大小的函数,这里帖三个普通例子:
collate\_fn is a function in DataLoader that organizes sequence datasets into batch sizes. Here are three common examples:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
def collate_fn_1(batch):
# 这是默认的,明显batch中包含相同形状的img\_tensor和label
return tuple(zip(*batch))
def coco_collate_2(batch):
# 传入的batch数据是被albu增强后的(字典结构)
imgs = [s['image'] for s in batch] # tensor, h, w, c->c, h, w , handle at transform in __getitem__
annots = [s['bboxes'] for s in batch]
labels = [s['category_id'] for s in batch]
# 以当前batch中图片annot数量的最大值作为标记数据的第二维度值,空出的就补-1。
max_num_annots = max(len(annot) for annot in annots)
annot_padded = np.ones((len(annots), max_num_annots, 5))*-1
if max_num_annots > 0:
for idx, (annot, lab) in enumerate(zip(annots, labels)):
if len(annot) > 0:
annot_padded[idx, :len(annot), :4] = annot
# 不同模型,损失值计算可能不同,这里ssd结构需要改为xyxy格式并且要做尺度归一化
# 这一步完全可以放到\_\_getitem\_\_中去,只是albu的格式需求问题。
annot_padded[idx, :len(annot), 2] += annot_padded[idx, :len(annot), 0] # xywh-->x1,y1,x2,y2 for general box,ssd target assigner
annot_padded[idx, :len(annot), 3] += annot_padded[idx, :len(annot), 1] # contains padded -1 label
annot_padded[idx, :len(annot), :] /= 640 # priorbox for ssd primary target assinger
annot_padded[idx, :len(annot), 4] = lab
return torch.stack(imgs, 0), torch.FloatTensor(annot_padded)
def detection_collate_3(batch):
targets = []
imgs = []
for _, sample in enumerate(batch):
for _, img_anno in enumerate(sample):
if torch.is_tensor(img_anno):
imgs.append(img_anno)
elif isinstance(img_anno, np.ndarray):
annos = torch.from_numpy(img_anno).float()
targets.append(annos)
return torch.stack(imgs, 0), targets # 做了stack, DataContainer可以不做stack
\end{lstlisting}
以上就是数据处理的相关内容。
最后再用DataLoader封装拆成迭代器,其相关细节,sampler等暂略。
The above is the relevant content of data processing. Finally, the DataLoader is used to encapsulate
and disassemble the iterator, and its related details, sampler, etc. are temporarily omitted.
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
data_loader = DataLoader(
dataset,
batch_size=batch_size,
sampler=sampler,
num_workers=num_workers,
collate_fn=partial(collate, samples_per_gpu=imgs_per_gpu),
pin_memory=False,
worker_init_fn=init_fn,
**kwargs)
\end{lstlisting}
\section{训练流程}
训练流程的包装过程大致如下:tools/train.py->apis/train.py->mmcv/
runner.py->mmcv/hook.py(后面是分散的),其中runner维护了数据信息,优化器,
日志系统,训练loop中的各节点信息,模型保存,学习率等.另外补充一点,以上包装过程,
在mmdet中无处不在,包括mmcv的代码也是对日常频繁使用的函数进行了统一封装.
The training process is roughly as follows: tools/train.py->apis/train.py->mmcv/
runner.py->mmcv/hook.py, where the runner maintains data information, optimizer, log,
information of each node in the training loop, model saving, learning rate, etc.
\label{trainpipeline}
\subsection{训练逻辑}
图见\ref{trainpipe_pic},注意它的四个层级.代码上,主要查看apis/train.py, mmcv中的runner相关文件.核心围绕Runner,Hook两个类.
Runner将模型,批处理函数batch\_pro
cessor,优化器作为基本属性,训练过程中与训练状态,各节点相关的信息
被记录在mode,\_hooks,\_epoch,\_iter,\_inner\_iter,\_max\_epochs,
\_max\_iters中,这些信息维护了训练过程中插入不同hook的操作方式.
理清训练流程只需看Runner的成员函数run.在run里会根据mode按配置中workflow的epoch循环调用train和val函数,跑完所有的epoch.
比如train:
See \ref{trainpipe_pic}, pay attention to its four levels. On the code, mainly
check the runner-related files in apis/train.py, mmcv. The core is around Runner and Hook.
Runner uses the model, batch function batch\_processor, and optimizer as basic attributes.
In the training process, the information related to the training state and each node is recorded
in mode,\_hooks,\_epoch,\_iter,\_i \newline
nner\_iter,\_max \_epochs,\_max\_iters, this information maintains
the operation method of inserting different hooks during the training process. To clarify the training
process, you only need to look at the runner’s member function run. In the run, the epoch cycle of the
workflow in the configuration will be called according to the mode. The train and val functions run all
epochs. For example, train:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
def train(self, data_loader, **kwargs):
self.model.train()
self.mode = 'train' # 改变模式
self.data_loader = data_loader
self._max_iters = self._max_epochs * len(data_loader) # 最大batch循环次数
self.call_hook('before_train_epoch') # 根据名字获取hook对象函数
for i, data_batch in enumerate(data_loader):
self._inner_iter = i # 记录训练迭代轮数
self.call_hook('before_train_iter') # 一个batch前向开始
outputs = self.batch_processor(
self.model, data_batch, train_mode=True, **kwargs)
self.outputs = outputs
self.call_hook('after_train_iter') # 一个batch前向结束
self._iter += 1 # 方便resume时,知道从哪一轮开始优化
self.call_hook('after_train_epoch') # 一个epoch结束
self._epoch += 1 # 记录训练epoch状态,方便resume
\end{lstlisting}
上面需要说明的是自定义hook类,自定义hook类需继承mmcv的Hook类,其默认了6+8+4个成员函数,也即\ref{trainpipe_pic}所示的6个层级节点,
外加2*4个区分train和val的节点记录函数,以及4个边界检查函数.从train.py中容易看出,在训练之前,已经将需要的hook函数注册到Runner的
self.\_hook中了,包括从配置文件解析的优化器,学习率调整函数,模型保存,一个batch的时间记录等(注册hook算子在self.\_hook中按优先级
升序排列).这里的call\_hook函数定义如下:
What needs to be explained above is the custom hook class. The custom hook class needs to inherit the mmcv
Hook class, which defaults to 6+8+4 member functions, 6 level nodes shown by \ref{trainpipe_pic},
2*4 node recording functions that distinguish train and val, and 4 boundary check functions.
It is easy to see from train.py that before training, the required hook function has been registered in
the runner's self.\_hook , Including optimizer parsed from configuration file, learning rate adjustment
function, model saving, time record of a batch, etc. (register hook operator in self.\_hook by priority
(Ascending order). The call\_hook function is defined as follows:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
def call_hook(self, fn_name):
for hook in self._hooks:
getattr(hook, fn_name)(self)
\end{lstlisting}
容易看出,在训练的不同节点,将从注册列表中调用实现了该节点函数的类成员函数.比如
It is easy to see that at different nodes of training, class member
functions that implement the function of the node will be called from the registration list.
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
class OptimizerHook(Hook):
def __init__(self, grad_clip=None):
self.grad_clip = grad_clip
def clip_grads(self, params):
clip_grad.clip_grad_norm_(
filter(lambda p: p.requires_grad, params), **self.grad_clip)
def after_train_iter(self, runner):
runner.optimizer.zero_grad()
runner.outputs['loss'].backward()
if self.grad_clip is not None:
self.clip_grads(runner.model.parameters())
runner.optimizer.step()
\end{lstlisting}
将在每个train\_iter后实现反向传播和参数更新.
Backpropagation and parameter update will be implemented after each train\_iter
学习率优化相对复杂一点,其基类LrUpdaterHook,实现了before\_run, before\_train\_epoch, before\_train\_iter三个hook函数,意义自明.
这里选一个余弦式变化,稍作说明:
The optimization of learning rate is relatively complicated.The base class LrUpdaterHook implements the
three hook functions before\_run, before\_train\_epoch, before\_train\_iter, and the meaning is self-evident.
Choose a cosine change here, a little explanation:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
class CosineLrUpdaterHook(LrUpdaterHook):
def __init__(self, target_lr=0, **kwargs):
self.target_lr = target_lr
super(CosineLrUpdaterHook, self).__init__(**kwargs)
def get_lr(self, runner, base_lr):
if self.by_epoch:
progress = runner.epoch
max_progress = runner.max_epochs
else:
progress = runner.iter # runner需要管理各节点信息的原因之一
max_progress = runner.max_iters
return self.target_lr + 0.5 * (base_lr - self.target_lr) * \
(1 + cos(pi * (progress / max_progress)))
\end{lstlisting}
从get\_lr可以看到,学习率变换周期有两种,epoch->max\_epoch,或者更大的iter->max\_iter,
后者表明一个epoch内不同batch的学习率可以不同,因为没有什么理论,所有这两种方式都行.
其中base\_lr为初始学习率,target\_lr为学习率衰减的上界,而当前学习率即为返回值.
As can be seen from get\_lr, there are two types of learning rate conversion
cycles, epoch->max\_epoch, or a larger iter->max\_iter, the latter shows that
the learning rate of different batches in an epoch can be different, because
there is no Any theory, all these two methods will work. Among them, base\_lr
is the initial learning rate, target\_lr is the upper bound of the learning rate
decay, and the current learning rate is the return value.
\section{Core}
\subsection{anchor}
\label{core:anchor}
anchor首先来源于rcnn,在我看来,anchor利用强监督的特点,在原图上构造了完整的bbox空间(非数学意义的空间),然后根据人为标定bbox,
选出有效的拟合集合(子空间,非严格表达),从而使得优化变得更有效(原本是实现了end2end)。
The anchor first comes from rcnn. In my opinion, anchor uses the characteristics of strong supervision
to construct a complete bbox space (a space with no mathematical meaning) on the original image, and
then calibrates the bbox according to man-made, and selects an effective fitting set (sub Space,
non-strict expression), so that the optimization becomes more effective (the original goal is end2end).
AnchorGenerator类为不同特征层生成anchor,其中特征层上单个像素的anchor是以此像素为中心,
按给定的尺度和扩展比率生成,也即一个像素对应的anchor数为len(scales)*len(ratios)。
The AnchorGenerator class generates anchors for different feature layers, where the anchor of a single
pixel on the feature layer is centered on this pixel and is generated at a given scale and expansion
ratio, that is, the number of anchors corresponding to a pixel is len(scales)*len( ratios).
输入参数base\_size, scales, ratios一般作用于非retinta类网络,含义分别表示:anchor在特征层上的基础大小(特征层相对于原图的stride),anchor在特征层上的尺度大小(可以多个,增加感受野),anchor在保持基础大小不变的情况下的长宽比.
比如输入图像大小(640*640), 选择(p2, p3)作为其特征层,则p2大小为(160*160),base\_ size=4,若设定ratios=[0.5,1.0,2.0], scales=[8, 16],
则在p2上一格对应的base\_anchor的(w,h)为[(45.25,22.63), (90.51, 45.25),
(32.00, 32.00), (64.00, 64.00), (22.63, 45.25),(45.25, 90.51)].其中$64=4*16*1,90.51=4*16*\sqrt{2}, 22.63=4*8/\sqrt{2}.$
那么每一格所对应的6个base\_anchor相对于中心点的偏移量即为(v2.0不取整):
The input parameters base\_size, scales, ratios generally act on non-retinta-type networks, meaning
respectively: the basic size of the anchor on the feature layer (the stride of the feature layer
relative to the original image), the scale size of the anchor on the feature layer (can Multiple,
increase the receptive field), the aspect ratio of anchor under the condition of keeping the basic
size unchanged. For example, input image size (640*640), choose (p2, p3) as its feature layer,
then p2 size is (160 *160), base\_ size=4, if ratios=[0.5,1.0,2.0], scales=[8, 16], then the
corresponding base\_anchor's (w,h) on p2 is: [(45.25,22.63), (90.51, 45.25),(32.00, 32.00), (
64.00, 64.00), (22.63, 45.25),(45.25, 90.51)]. Among them
$64=4*16*1,90.51=4 *16*\sqrt{2}, 22.63=4*8/\sqrt{2}.$ Then the offset of the 6
base\_anchor corresponding to each grid relative to the center point is (v2.0 does not take rounding):
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
[[-21., -9., 24., 12.],
[-43., -21., 46., 24.],
[-14., -14., 17., 17.],
[-30., -30., 33., 33.],
[ -9., -21., 12., 24.],
[-21., -43., 24., 46.]]
\end{lstlisting}
而retina等网络会存在octave\_base\_scale,scales\_per\_octave,其含义和上面保持对应。
The meaning of octave\_base\_scale, scales\_per\_octave of retina and other networks remains the same as above.
因为此处得到的anchor是以特征图为坐标系的,所以要得到原图上的anchor,还得把中心点变为对应到原图的上的点。
那么回到源代码,容易看出,gen\_single\_level\_base\_anchors得到单个特征层的所有anchor,gen\_base\_anchors将
不同特征层的anchor汇集到列表里,single\_level\_grid
\_anchors将单个特征层的anchor映射到原图坐标,grid\_anchors则
汇集所有相对原图的anchor到列表中,列表每个元素为一个tensor,记录了某个特征层上的所有anchor。
代码涉及到两个技巧:
Because the anchor obtained here is based on the feature map as the coordinate system, to get
the anchor on the original image, the center point must be changed to the point corresponding
to the original image. So back to the source code, it is easy to see that
gen\_single\_level\_base\_anchors gets all the anchors of a single feature layer,
gen\_base\_anchors puts the anchors of different feature layers into the list,
single\_level\_grid\_anchors The anchors of a single feature layer are mapped to the coordinates
of the original image. grid\_anchors collects all the anchors relative to the original image into a list.
Each element of the list is a tensor, which records all the anchors on a feature layer. The code involves
two tricks:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
# 1.gen_single_level_base_anchors
ws = (w * w_ratios[:, None] * self.scales[None, :]).view(-1) # shape: (len(ratios), len(scales))
hs = (h * h_ratios[:, None] * self.scales[None, :]).view(-1)
# 2.single_level_grid_anchors
shifts = torch.stack([shift_xx, shift_yy, shift_xx, shift_yy], dim=-1)
shifts = shifts.type_as(base_anchors)
# first feat_w elements correspond to the first row of shifts
# add A anchors (1, A, 4) to K shifts (K, 1, 4) to get
# shifted anchors (K, A, 4), reshape to (K*A, 4)
# base_anchors:(fw*hw*num_anchors, 4), shifts:(fw*hw, 4)
all_anchors = base_anchors[None, :, :] + shifts[:, None, :] #Automatic tensor expansion
all_anchors = all_anchors.view(-1, 4)
\end{lstlisting}
另一个PointGenerator代码如下:
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
class PointGenerator(object):
def _meshgrid(self, x, y, row_major=True):
xx = x.repeat(len(y))
yy = y.view(-1, 1).repeat(1, len(x)).view(-1)
if row_major:
return xx, yy # (xx[i],yy[j])为一个矩阵A_xy的坐标(i,j), dim(xx)=1时
else:
return yy, xx
def grid_points(self, featmap_size, stride=16, device='cuda'):
feat_h, feat_w = featmap_size
shift_x = torch.arange(0., feat_w, device=device) * stride # 特征网格对应到原图网格
shift_y = torch.arange(0., feat_h, device=device) * stride
shift_xx, shift_yy = self._meshgrid(shift_x, shift_y)
stride = shift_x.new_full((shift_xx.shape[0], ), stride)
shifts = torch.stack([shift_xx, shift_yy, stride], dim=-1) # x, y和对应点的stride大小
all_points = shifts.to(device)
return all_points
\end{lstlisting}
代码上,主要的难点在于高维张量的处理。
In code implementation, the main difficulty lies in the processing of high-dimensional tensors.
\subsection{bbox}
\subsubsection{coder}
bbox的编码是基础重要的,一家之言:anchor genertor让优化空间合理,bbox编码让优化更合理(guided anchor处在这两者之间)。
通常bbox连归一化都不做的,效果不会好到哪里去(可以试试retinaface)。但是在不同方法体系里面,就不好说了,比如以中心点
表示方式的CenterNet,对$w, h$做回归,就没有编码,直接拟合$\Delta w, \Delta h$,但它是在下采样$1/4$后的特征图上,
也即尺度上还是除了4。总归而言,要让优化对象存在一个合理的空间上,才是本质的。
The bbox encoding is fundamentally important. statements of a school: anchor genertor makes the optimization space
reasonable, and bbox encoding makes the optimization more reasonable (guided anchor is between these two).
Usually bbox does not even do normalization, the effect will not be good (you can try retinaface).
But in different method systems, it is not easy to say, for example, CenterNet in the form of center point,
for regression of $w, h$, there is no coding, directly fitting $\Delta w, \Delta h$, but it is On the
feature map after downsampling $1/4$, that is, the scale is still divided by 4. In short, it is essential
for the optimized objects to exist in a reasonable space.
one, two stage的bbox编码方式,来源于14年Ross Girshick等人写的rcnn,
其编码为
The one and two stage bbox encoding method is derived from the rcnn written by Ross Girshick and others in 2014,
and its encoding is
$$
\begin{aligned}
t_{x} &=\left(G_{x}-P_{x}\right) / P_{w} \label{1} \\
t_{y} &=\left(G_{y}-P_{y}\right) / P_{h} \label{2}\\
t_{w} &=\log \left(G_{w} / P_{w}\right) \label{3}\\
t_{h} &=\log \left(G_{h} / P_{h}\right) \label{4}
\end{aligned}
$$
此处$P$是anchor,$G$是标框。
$P$ is anchor, and $G$ is true label.
解码为:
decoded as:
$$
\begin{aligned}
\hat{G}_{x} &=P_{w} d_{x}(P)+P_{x} \\
\hat{G}_{y} &=P_{h} d_{y}(P)+P_{y} \\
\hat{G}_{w} &=P_{w} \exp \left(d_{w}(P)\right) \\
\hat{G}_{h} &=P_{h} \exp \left(d_{h}(P)\right)
\end{aligned}
$$
% 相应的优化函数我也帖一下:
% $$
% \mathbf{w}_{\star}=\underset{\hat{\mathbf{w}}_{\star}}{\operatorname{argmin}}
% \sum_{i}^{N}\left(t_{\star}^{i}-\hat{\mathbf{w}}_{\star}^{\mathrm{T}}
% \phi_{5}\left(P^{i}\right)\right)^{2}+\lambda\left\|\hat{\mathbf{w}}_{\star}\right\|^{2}
% $$
% 关于这种编码的合理性,见\ref{sub:anchorhead}的分析。
那么代码如何实现呢?
How is the code implemented?
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
class BaseBBoxCoder(metaclass=ABCMeta):
def __init__(self, **kwargs):
pass
@abstractmethod
def encode(self, bboxes, gt_bboxes):
pass
@abstractmethod
def decode(self, bboxes, bboxes_pred):
pass
def bbox2delta(proposals, gt, means=(0., 0., 0., 0.), stds=(1., 1., 1., 1.)):
assert proposals.size() == gt.size()
proposals = proposals.float() # 浮点数
gt = gt.float()
px = (proposals[..., 0] + proposals[..., 2]) * 0.5 # 中心点(x1+x2)/2
py = (proposals[..., 1] + proposals[..., 3]) * 0.5
pw = proposals[..., 2] - proposals[..., 0]
ph = proposals[..., 3] - proposals[..., 1]
gx = (gt[..., 0] + gt[..., 2]) * 0.5
gy = (gt[..., 1] + gt[..., 3]) * 0.5
gw = gt[..., 2] - gt[..., 0] # 宽(x2 - x1) = w
gh = gt[..., 3] - gt[..., 1]
dx = (gx - px) / pw # \eqref{1}
dy = (gy - py) / ph # \eqref{2}
dw = torch.log(gw / pw)
dh = torch.log(gh / ph) # \eqref{4}
deltas = torch.stack([dx, dy, dw, dh], dim=-1) # 最后一维度stack
means = deltas.new_tensor(means).unsqueeze(0) # new_tensor和unsqueeze(扩张维度)
stds = deltas.new_tensor(stds).unsqueeze(0)
deltas = deltas.sub_(means).div_(stds) # sub_ 和 sub 的区别
return deltas
gx = torch.addcmul(px, 1, pw, dx) # gx = px + pw * dx
gy = torch.addcmul(py, 1, ph, dy) # gy = py + ph * dy
# torch.addcmul(input, tensor1, tensor2, *, value=1, out=None) → Tensor
# outi=input_i+value×tensor_1i×tensor_2i
\end{lstlisting}
stack, new\_tensor, unsqueeze, div\_可以学习一下。另外为何$\Delta$坐标系(说法不严谨,
但可以理解,cv论文经常这样)要做高斯归一化变换呢?可参考\href{https://arxiv.org/abs/1904.04620}{Gaussian YOLOv3}。
我的理解,将优化空间映射到高斯$\mu-\sigma$分布中。
Basic tensor operators such as stack, new\_tensor, unsqueeze, div\_, etc. are used here. In addition, why does
the $\Delta$ coordinate system (not rigorous, but understandable) do Gaussian normalized transformation?
Refer to \href{https://arxiv.org/abs/1904.04620}{Gaussian YOLOv3}. My understanding is to map the optimization
space to the Gaussian $\mu-\sigma$ distribution.
在解码函数delta2bbox有如下函数需要注意一下(tensor.*):
repeat, clamp, expand\_as, exp, torch.addcmul, view\_as。
In the decoding function delta 2 bbox, the following functions need to be mastered (tensor.*):
repeat, clamp, expand\_as, exp, torch.addcmul, view\_as, these basic functions are necessary skills to implement
the detection algorithm.
另外一种编码方式,来自19年Chenchen Zhu等人写的\href{https://arxiv.org/abs/1903.00621}{FSAF}。
将(x1, y1,x2, y2)编码为(top, bottom, left, right),含义自明。代码上需要注意的是$w, h$的归一化以及整体
的归一化因子,默认除以4。此编码对应的bbox loss 也将变为Iou 系列 Loss, 若说明,也应该是在模型解析部分补充。
Another encoding method is from \href{https://arxiv.org/abs/1903.00621}{FSAF} written by Chenchen Zhu
et al in 2019. $(x1, y1, x2, y2)$ is encoded as (top, bottom, left, right), meaning self-explanatory.
The code needs to pay attention to the normalization of $w, h$ and the overall normalization factor,
which is divided by 4 by default. The bbox loss corresponding to this code will also become the Iou
series Loss. If stated, it should also be added in the model analysis part.
\subsubsection{assigners}
assign主要给特征图上的anchor赋予有意义的监督信息,从而使优化更为有效。那么自然他的参数至少包含
anchors,gt\_bboxes,gt\_labels。实际因为数据的复杂,或许有gt\_bboxes\_ignore等信息。
assign的机制可总结为如下四点:
Assign mainly assigns meaningful supervision information to the anchors on the feature map,
making optimization more effective. Then naturally its parameters include at least anchors,
gt\_bboxes, gt\_labels. Actually, because of the complexity of the data, there may be information
such as gt\_bboxes\_ignore.
The assign mechanism can be summarized as follows:
\begin{itemize}
\item[1.] 将所有框置为背景(label置-1) Set all boxes as background (label set -1)
\item[2.] 将与所有gts的iou小于neg\_iou\_thr的置为0
\item[3.] 将与所有gt的max(iou)大于pos\_iou\_thr的框置为对应gt
\item[4.] 对每个gt,将与其iou最大的框置为gt
\end{itemize}
关于上述四点:用生成的bbox去拟合实际的bbox,那么实际的bbox至少有一个可以优化的对象。
有了这些认识,代码就相对容易了。
\lstset{style=mystyle}
\begin{lstlisting}[language=Python]
@BBOX_ASSIGNERS.register_module()
class MaxIoUAssigner(BaseAssigner):
# 关键代码(BaseAssigner,抽象基类)
# assign:
... # gt太多,就cpu上计算
overlaps = self.iou_calculator(gt_bboxes, bboxes) # gt在前,更快吧
...
# assign_wrt_overlaps
# 1. 默认全部为背景-1
assigned_gt_inds = overlaps.new_full((num_bboxes, ),
-1,
dtype=torch.long)
# 每个bbox与所有gt的最大iou值和index
max_overlaps, argmax_overlaps = overlaps.max(dim=0)