PyTorch

持续更新

Posted on Feb.19, 2020

Pytorch常用代码段

📓Cookbook

框架使用经验

深度学习实验管理

以一个开源codebase为例，如果长期维护一个深度学习项目，代码的组织就比较重要了。如何设计一个简单而可扩展的结构是非常重要的。这就需要用到软件工程中的OOP设计。让我们高效、标准化地管理深度学习实验。

跳转
熟悉工具
参数管理
日志管理

遇到的坑/bug

in-place操作

          卷积这些操作不是in-place，
          relu可选in-place，
          -=,+=这类是in-place，
          CAIN实现的改版的meanshift——sub_mean()就也是in-place的。
          因为你进入网络就进行了原地操作，所以prediction = model(input)会改动到input.
          解决方案就是，进入网络后
          foward(x):
                x = x.clone()
          RCAN、EDSR这些DIV2k数据集上的Meanshift是用卷积方式实现的，就不存在这个问题。

          总结: 不想受到in-place影响的变量，可以拷贝x.copy()，或者如果是张量可以x.clone()。

检测inplace异常

          # detect inplace-error
          torch.autograd.set_detect_anomaly(True)

初始化

PyTorch Module自带的默认初始化方法往往更可靠。

数据集输入问题

把训练集dataloader的shuffle关掉，溯源找到数据集对应元素。例如优酷超分数据集就有2048x1152分辨率的视频。
           要记得数据集是有shuffle的，在处理数据逻辑时一定要意识到。

模型验证集挺好的，结果测试色偏、失真特别奇怪

很有可能就是忘了model.load_state_dict()，是按默认初始化跑的图片。psnr直接会降到十几二十。

PyTorch测试模型执行计算耗费的时间

          一般我们都会使用这种方式一测试时间

          # 方式一:
          star = time.time()
          result = model(input)
          end = time.time()

          但正确的应该是下边这种方式二

          # 方式二:
          torch.cuda.synchronize()
          start = time.time()
          result = model(input)
          torch.cuda.synchronize()
          end = time.time()

          为什么这样呢？
          在pytorch里面，程序的执行都是异步的。如果采用第一种方式，测试的时间会很短，因为执行完end=time.time()程序就退出了，后台的cu也因为python的退出退出了；
              如果采用第二种方式，代码会同步cu的操作，等待gpu上的操作都完成了再继续计算end = time.time()

          如果将方式一代码改为方式三：

          # 方式三:
          start = time.time()
          result = model(input)
          print(result) #或 result.cpu()
          end = time.time()

          这时候会发现方式三和方式二的时间是类似的，因为方式三会等待gpu上的结果执行完传给print函数，所以此时间就和方式二同步操作的时间基本上是一致的了。
              将print(result)换成result.cpu()也可以得到相同的结果。


          来自作者：几时见得清梦 的简书笔记

torch.cuda.synchronize()


            #
            
            from contextlib import contextmanager

            @contextmanager
            def timer(self, name):
                start = torch.cuda.Event(enable_timing=True)
                end = torch.cuda.Event(enable_timing=True)

                start.record()
                yield
                end.record()

                torch.cuda.synchronize()
                print(f'[{name}] done in {start.elapsed_time(end):.3f} ms')

            with timer('GTX 1080 Ti 720p inference'):
                    out = model(input)
            
            这样结果才一致的惹。

对比试验的优化器和lr更新策略

①不使用scheduler, Adam + 一个大的epoch训下去。这样比较公平。
②ReduceLROnPlateau有点运气成分，要么就在loss是否下降上做。如果是验证集PSNR，那最好将容忍度设很大。
③余弦退火。策略一致就是可以的。

train 与 val 的区别

          model.train() ：启用 BatchNormalization 和 Dropout
          model.eval() ：不启用 BatchNormalization 和 Dropout
          val时会加上 with torch.no_grad()

model.val() 的注意事项

首先看看官方Doc接口
Model.eval() gives incorrect loss for model with batchnorm layers
有人说set the track_running_stats=False for all batch norm layers in the model.
for child in model.children():
    for ii in range(len(child)):
        if type(child[ii])==nn.BatchNorm2d:
            child[ii].track_running_stats = False
有人说这似乎就是回到了不用model.eval()，uses the current batch’s
mean and variance to do the normalization，我认为这就相当于momentum=0。不受历史的影响。
而出现这个情况的原因，就是数据集不适合用batchnorm，比如太多noisy了(我任务确实如此)，batchsize太小了(DDP时确实)，分布相差太大啦，
所以这里要把momentum设小，设置为0，即track_running_stats = False。
但也有个官方人士建议设大的，这样更能规避异常值(non-stationary)，维持之前学习到的。

个人认为要深刻理解，灵活应用，再想一想。

来，我们从原理说起。本来BatchNorm是回到正太分布，这样对训练有许多帮助。
但是呢，怎么回到正太分布呢，你要知道缩放weight和平移量bias。
任何框架下的任何形式的Normalization都是如此，这个缩放和平移的值是可学习、可训练的参数。
我们通过这两个参数，能把特征转到'正太'上。但是！我们不可能，也没想要完全严格的正太分布。
①来，weight和bias是根据统计特性学习来的，对于特定的特征它也转不到正太。
②来，若是真的全为正太了，还有啥区分度，没有可用信息了。
所以我们的实际目的，只是控制均值和方差的大小。而不是分要它们严格等于多少。

然后momentum小的话，可以让weight和bias的得出多根据当前的输入，少累计历史的影响。
我认为这样是防止在历史上的"过拟合"，实则历史信息对更好接近正太帮助不大。适合我们在分布经常变化的情况下使用。
track_running_stats=False也是这个作用，只利用当前batch的统计信息，没有历史buffer。

而momentum大的话，则更依赖历史buffer里的信息。这样说明历史信息的作用是比当前统计信息帮助大的。

综上，去高斯噪声这样的、high-level这样的，特征其实分布很有特点，变化不大。是适合用batchnorm，并且momentum大一点。
至于对齐、光流这样的，momentum小一点保险。超分辨率，不用batchnorm更好。
用或者不用都好说，最不希望的是训练时其他模块配合batchnorm"过拟合"到正太，但你推理时batchnorm又效果不好，到不了正太。

至于model.eval()对其的影响，
During training, this layer keeps a running estimate of its computed mean and variance.
The running sum is kept with a default momentum of 0.1.

During evaluation, this running mean/variance is used for normalization.
意思就是eval()时，才把所有历史信息拿出来用。这些都是训练(running)时按momentum来存着的。
为什么要开启它呢，认为历史信息有帮助。
这叫做"贝塞尔校正"
I mean, why is it better to use model.eval and
take the running statistics and not rely on the current test image statistics?
那为什么以为有帮助的eval()还导致性能下降呢。就是历史信息其实起了负面影响。
但有时eval()是有帮助的，会提升效果。你可以通过同样的checkpoint，推理时改变track_running_stats来判断。
最后择优。
if you compute test stat, then you are basically “train” on test set,
because stat in this case is a trained param.
You can do it, nobody says you can’t, just that ppl would consider it “cheating”
但还有很重要的一点！一定要测试时开着eval()，不然相当于数据"泄露"。测试集在训练时留了痕迹。
这样的测试是不公正的。
这是为什么用eval()的原因。可以防止"在测试集过拟合"。至于用它反而性能下降就是上面说的历史信息反而不好。

结论：我们一定要推理用eval()。至于momentum、track_running_stats到底怎么用，就不一而足了，不是原则问题，
怎么效果好怎么来。我的经验是，历史信息一般是有用的。一般来说track_running_stats开着会有帮助。
历史信息什么情况越来越没用呢，甚至负面作用呢，就是分布变化大、batchsize小(也使得分布波动大)的时候，
历史与当前的Gap太大，就不好了。
至于那些eval()之后性能还下降的，首先它们一开始数据泄露了，不值得提倡。其次track_running_stats=False罢，历史信息负面帮助。
终究track_running_stats=True or False对比便知, 或者momentum=0来试, 择优。

我觉得最好就是，训练完成后，再大batch的只forward，不backward，再保存模型。让历史信息充分发挥作用！！！


但也有人说这其实不是不对的。
我的观点，或许这和任务以及你的训练batch大小有关，如果任务不讲究batch，
训练时batch又小，本来就不稳定，或许不需要加model.eval()了。

Batch Normalization里有一个momentum参数, 该参数作用于mean和variance的计算上,
这里保留了历史batch里的mean和variance值,即 moving_mean和moving_variance,
借鉴优化算法里的momentum算法将历史batch里的mean和variance的作用延续到当前batch.
一般momentum的值为0.9 , 0.99等. 多个batch后, 即多个0.9连乘后,最早的batch的影响会变弱.

所以也有建议，在遇到non-stationary training时，把momentum设小一些。
def evaluate_batch(net, batch, output, shouldeval):
        if shouldeval:
            net.eval()
            net.bn1.train()
            # 我感觉这更要绝，不仅不用eval()了，还不留任何历史影响。是否矫枉过正了，我不用eval就是了。
            net.bn1.momentum = 0.0
        else:
            net.train()
...
        #before returning
        net.bn1.momentum = 0.1

下面这段是我看到最清晰直接、讲明白原理本质的解释了。
「The high validation loss is due to the wrong estimates of the running stats.
Since you are feeding a constant tensor (batchone: mean=1, std=0) and a random tensor (batchtwo: mean~=0, std~=1), the running estimates will be shaky and wrong for both inputs.

During training the current batch stats will be used to compute the output, so that the model might converge.
However, during evaluation the batchnorm layer tries to normalize both inputs with skewed running estimates, which yields the high loss values.
Usually we assume that all inputs are from the same domain and thus have approx. the same statistics.

If you set track_running_stats=False in your BatchNorm layer, the batch statistics will also be used during evaluation, which will reduce the eval loss significantly.」


If you turn track_running_stats off (as suggested in the post) you will instead
use the mean and std of the batch in eval mode. This is flawed and incorrect usage,
since you will get an inference result which is based on the data in your batch.
所以提出这个观点的人认为就不要用batchnorm了，而是用groupnorm。

很多人遇到这个问题无计可施，也都提到不用model.eval()就没问题。
I tried:
・change the momentum term in BatchNorm constructor to higher.
・before you set model.eval() , run a few inputs through model (just forward pass, you dont need to backward). This will help stabilize the running_mean / running_std values.
・increase Batchsize
Nothing helped. 不过这人发现自己是在不同地方用了同样的batchnorm.
In the end I saw I was indeed using the same BatchNorm layers in different parts of the network.
Once I changed that it worked again.

torch.nn 与 torch.nn.functional

加载模型接着训练，Adam动量对齐

如果使用了Adam，应该保存optimizer state dict，以便继续训练时加载它。

Metrics异常

如果LPIPS与PSNR、SSIM的得分明显不成正比时，PSNR异常低的那组数据考虑检查是不是帧没对上。有次同学犯了个好玩的bug，就是把EDVR弄成了input(0,1,2,3,4)，超分3号帧。
结果测试时自然都会移动一格。
没对上的话，PSNR肯定低，但对于LPIPS这样的指标影响则不大。

PyTorch1.6训练保存的模型在1.4低版本无法加载

在1.6:  torch.save(model_.state_dict(), 'model_best_bacc.pth.tar', _use_new_zipfile_serialization=False)
https://github.com/pytorch/pytorch/issues/48915
注： 这种方式转换过的模型，字典关键词会移除多卡的标识'module'

模型和保存点不匹配

      在排除了GPU/CPU和单卡多卡的问题之后，怀疑到是代码变动。
          state_dict = torch.load(arg.ckp)
          from collections import OrderedDict
          new_state_dict = OrderedDict()
          for k, v in state_dict.items():
              name = k[7:] # remove 'module'.
              new_state_dict[name] = v
          print(new_state_dict.keys())
      将结果输出到 > module.txt看，宽度缩小到单栏，然后顺着网络层排查是哪儿的代码没对上。
      或
      for k in state_dict.keys():
        print(k)

      def print_network(net):
        num_params = 0
        for param in net.parameters():
            num_params += param.numel()
        print(net)
        print('Total number of parameters: %d' % num_params)

层层检查shape变化

          image = torch.zeros((1, 3, 64, 64))
          out = image
          for name, op in resnet18.items():
            out = op(out)
            print(name, out.shape)
          不过这需要定义和foward出现的先后顺序一致。

显存不够用，不满足大patch & 合适的batch size

          梯度累加
          for i,(features,target) in enumerate(train_loader):
              outputs = model(images)  # 前向传播
              loss = criterion(outputs,target)  # 计算损失
              loss = loss/accumulation_steps   # 可选，如果损失要在训练样本上取平均

              loss.backward()  # 计算梯度
              if((i+1)%accumulation_steps)==0:
                  optimizer.step()        # 反向传播，更新网络参数
                  optimizer.zero_grad()   # 清空梯度

          不过bn层会受到点影响，可通过调小momentum参数解决。
          https://www.zhihu.com/question/303070254

          DDP框架下的梯度累加
          from contextlib import nullcontext

          if local_rank != -1:
              model = DDP(model)

          optimizer.zero_grad()
          for i, (data, label) in enumerate(dataloader):
              my_context = model.no_sync if local_rank != -1 and i % accumulation_steps != 0 else nullcontext
              with my_context():
                  prediction = model(data)
                  loss = loss_fn(prediction, label) / accumulation_steps
                  loss.backward()
              if i % accumulation_steps == 0:
                  optimizer.step()
                  optimizer.zero_grad()
          参考：
          https://zhuanlan.zhihu.com/p/250471767

GAN训练注意事项

        判别器部分尽量不要复制粘贴生成器的代码。因为生成器写错容易在运行时报出，但判别器复制粘贴导致的错误就很隐蔽。
        比如：
        optimizerD = torch.optim.Adam(model.parameters(), lr=5e-5), D.parameters()就误写为model.parameters()。

Python, NumPy, Pytorch中的多进程中每个进程的随机化种子误区

        random.random(), numpy.random(), torch.random()
        参考1
        参考2
        PyTorch >= 1.9 官方修复该问题

torch.round的上下取整规则

        和OpenGL表现一样：x.5，x是奇数的话是向上取整，x是偶数向下
        https://github.com/Microsoft/DirectXShaderCompiler/issues/1671

复杂网络结构的文件，添加系统环境变量

        目的：防止出现 ModuleNotFound 错误
        sys.path是一个列表 list ,它里面包含了已经添加到系统的环境变量路径。
        当我们要添加自己的引用模块搜索目录时，可以通过列表 list 的 append()方法；
        sys.path.append() # 绝对路径或相对路径均可
        注1：这种方法是运行时修改，脚本运行后就会失效。
        注2：系统环境只能是目录，不能是文件。
        注3：如果需要自定义搜索优先级，可以使用如sys.path.insert(1, "./model")。

        from pprint import pprint
        pprint(sys.path) # 在这里，pprint模块只是用来使事情看起来漂亮

RNN dataloader

        RNN/LSTM关于不同长短视频的对齐，有内置的pad方式：
        >>> x1 = torch.randn(90, 73)
        >>> x2 = torch.randn(90, 73)
        >>> x3 = torch.randn(87, 73)
        >>> x4 = torch.randn(92, 73)
        >>> y = [x1, x2, x3, x4]
        >>> import torch.nn.utils.rnn as rnn_utils
        >>> y = rnn_utils.pad_sequence(y, batch_first=True)
        >>> 成为[4, 92, 73]形状的张量

图像处理库效果不佳

        https://github.com/assafshocher/ResizeRight
        参考推文的一句话：
        Yeah, my advice to all my students is to never trust any image processing these libraries do. And always, ALWAYS verify visually that you get expected behavior, no matter how “trivial” the operation

装饰器

        注册器的预备知识 ———— 装饰器
        代码运行期间动态增加功能的方式，称之为“装饰器”（Decorator）

        为什么要使用装饰器
        假如现有一个求和函数add，
        def add(a, b):
	        print(a + b)
        现在要求统计函数执行的时长。

        方式1：对原函数做修改。
        def add(a, b):
            start = time.time()
            print(a + b)
            time.sleep(2) # 模拟耗时操作
            long = time.time() - start
            print(f'共耗时{long}秒。')
        方式1缺点：不仅增加了耦合性，扩展和复用也变得难以实现。如果再增加一个记录日志的功能以及对程序中所有的函数都进行时长统计，工作量将非常大，想想就瑟瑟发抖。

        方式2：
        def timer(func,*args):
            start = time.time()
            func(*args)
            time.sleep(2)#模拟耗时操作
            long = time.time() - start
            print(f'共耗时{long}秒。')

        timer(add,1,2)
        方式2缺点：没有改变原函数，但是改变了函数调用方式，每个调用add的地方都需要修改。

        方式3：
        使用装饰器：既不用修改原函数，又不用改变调用方式，装饰器闪亮登场。
        @timer 装饰器修饰函数 add ()：
        def timer(func):
            def wrapper(*args, **kwargs):
                start = time.time()
                func(*args, **kwargs) # 此处拿到了被装饰的函数func
                time.sleep(2)# 模拟耗时操作
                long = time.time() - start
                print(f'共耗时{long}秒。')
            return wrapper # 返回内层函数的引用

        @timer
        def add(a, b):
            print(a+b)

        add(1, 2) # 正常调用add
        把@timer放到add()函数的定义处，相当于执行了语句：add = timer(add)

        什么是装饰器
        首先来看几个概念：
        高阶函数（嵌套函数）：接受函数为入参，或者把函数作为结果返回的函数。
        闭包：指延伸了作用域的函数，其中包含函数定义体中引用、但是不在定义体中定义的非全局变量。简单来说就是嵌套函数引用了外层函数的变量。

        装饰器本质上就是一个返回函数的高阶函数，它可以让其它函数在不经过修改的情况下增加一些功能。这也就是装饰的意义，这种装饰本身代表着一种功能，如果用它修饰不同的函数，那么也就是为这些函数增加这种功能。
        一般而言，我们可以使用装饰器提供的 @ 语法糖（Syntactic Sugar）来修饰其它函数或对象。

        装饰器的加载到执行的流程：
        模块加载 ->> 遇到@，执行timer函数，传入add函数 ->> 生成timer..wrapper函数并命名为add，其实是覆盖了原同名函数 ->> 调用add(1, 2) ->> 去执行timer..wrapper(1, 2) ->> wrapper内部持有原add函数引用(func)，调用func(1, 2) ->>继续执行完wrapper函数

        其它问题
        - 存在多个装饰器，执行顺序是什么样
        - 怎么写一个带参数的装饰器
        - 类装饰器，不带参数的类装饰器？
        - 经过装饰器之后的函数还是原来的函数吗？如何伪装成原函数，拥有原函数的属性
        参考链接：https://blog.csdn.net/u011331397/article/details/113481370

注册器

        什么是注册：
        注册器可以看作是完成了string类型->类名（如模型名称->模型类）的一个映射。单个注册器包含的这些类通常具有相似的API，但是实现不同的算法。比如说目标检测中的主干网络。

      要注册的模块
      models/model.py:
      class Model:
          pass

      @Registers.model.register
      class Model1(Model):
          pass


      @Registers.model.register
      class Model2(Model):
          pass


      @Registers.model.register
      class Model3(Model):
          pass

      注册器 Register
      class Register:

      def __init__(self, registry_name):
          self._dict = {}
          self._name = registry_name

      def __setitem__(self, key, value):
          if not callable(value):
              raise Exception(f"Value of a Registry must be a callable!\nValue: {value}")
          if key is None:
              key = value.__name__
          if key in self._dict:
              logging.warning("Key %s already in registry %s." % (key, self._name))
          self._dict[key] = value

      def register(self, target):
          """Decorator to register a function or class."""

          def add(key, value):
              self[key] = value
              return value

          if callable(target):
              # @reg.register
              return add(None, target)
          # @reg.register('alias')
          return lambda x: add(target, x)

      def __getitem__(self, key):
          return self._dict[key]

      def __contains__(self, key):
          return key in self._dict

      def keys(self):
          """key"""
          return self._dict.keys()

        补充一个知识点，@是python的装饰器语法糖。
        @decorate
        def func():
        等价于
        func = decorate(func)

        关键是register函数，它可以作为装饰器，注册一个函数或者一个类。例如：
        @register_obj.register("model_one")
        class Model1:
        最终执行的是add("model_one", Model_1)。参考link

        1. 注册Register。
        Register_func = Register("Register_func")

        @Register_func.register
        def add(x,y):
            return x+y

        @Register_func.register
        def minus(x,y):
            return x-y
        '''
        这里register函数是一个装饰器，相当于注册了{minus.__name__:minus}到字典里。
        '''

        @Register_func.register
        def multi(x,y):
            return x*y

        @Register_func.register
        def div(x,y):
            return x/y
        2. 使用注册模块。
        operation = Register_func["add"]
        result = operation(1,2)
        print(result)

lambda表达式

        函数式编程（Functional Programming）源自于数学理论，它似乎也更适用于数学计算相关的场景。
        很多人都在谈论函数式编程，只是很多人站在不同的角度看到的是完全不一样的风景。坚持实用主义的 Python 老司机们对待 FP 的态度应该更加包容，虽然他们不相信银弹，但冥冥中似乎能感觉到 FP 暗合了 Python 教义（The Zen of Python）的某些思想，而且既然 Python 是一门多范式编程语言，并在很大程度上支持函数式编程，那就更没有理由拒绝它。

        map()：将序列中的元素通过处理函数处理后返回一个新的列表
        filter()：将序列中的元素通过函数过滤后返回一个新的列表
        reduce()：将序列中的元素通过一个二元函数处理返回一个结果

        filter/reduce/map等基本的内置数据处理函数，需要两个参数,第一个是一个处理函数,第二个是一个序列(list,tuple,dict)。

        lambda是Python支持一种有趣的语法，它允许你快速定义单行的最小函数，类似与C语言中的宏。
        结合内置函数，可以既高效又简洁：
        li = [1, 2, 3, 4, 5]
        # 序列中的每个元素加1
        map(lambda x: x+1, li) # [2,3,4,5,6]

        # 返回序列中的偶数
        filter(lambda x: x % 2 == 0, li) # [2, 4]

        # 返回所有元素相乘的结果
        reduce(lambda x, y: x * y, li) # 1*2*3*4*5 = 120