文章CVPR 2020 细粒度分类挑战赛冠军方案介绍
最新的比赛是关于细粒度分类的,在之前,天涯明月刀分享了CVPR 2019年的细粒度分类挑战赛冠军方案。我下面带来CVPR 2020 细粒度分类挑战赛冠军方案介绍,多看看大佬的经验还是挺有好处的,本次比赛由支付宝天筭安全实验室获得冠军。 赛题介绍赛题任务:根据苹果树叶子图片区分不同种类的疾病,提高疾病分类的准确率,从而减少化学药品的滥用,及其导致的耐药病原体菌株出现的问题。 赛题数据集描述本次比赛数据集集包含 1821 张训练图像和 1821 张测试图像,每张图像有四种可能的标签(健康、锈病、痂病、同时拥有两种疾病),这四种类型的比例为 6:6:6:1,数据集中有一部分不准确标签。数据集的特征很明显,一个数据量少,二是存在多标签,三是数据不均衡,四是存在错误标签干扰。其中数据量少是限制模型精度提升的重要原因,多标签问题和数据不均衡问题影响较小,存在错误标签的问题是需要想办法解决的,不然对精度影响较大 评价指标评价使用ROC/AUC,最后取各标签的平均值。ROC/AUC评价指标也算是常用评判方法了,当测试集正负样本分布变化的时候,ROC曲线能基本保持不变,也就有效缓解了类不均衡时单纯依靠准确率无法有效评价模型好坏的问题。下图就时ROC/AUC与Precision—Recall曲线的对比:其中a、c为ROC曲线,b、d为Precision-Recall曲线,a和b为测试集样本均衡的曲线图,c和d为样本不均衡时的曲线图(负样本增加了10倍),可以看到ROC曲线在两种情况下基本保持了相同的趋势,而Precision-Recall曲线则变化十分明显。 冠军组解决方案0. 参赛详情本次比赛共有 1317 支队伍参加,参赛队伍众多,支付宝天筭安全实验室最终得分是98.445,第二名为98.182,在这种规模的比赛中第一二名的差距还是比较大的。 1. 数据增强随机亮度、随机对比度 这两个每次二取一使用,其示意图如下:随机亮度:随机对比度:中值滤波、运动模糊、高斯模糊 这三个每次三取一使用以及上下翻转:左右翻转:随机旋转、缩放以及剪切基本上都是常见的增强方法,没有使用cutmix、mixup以及cutout之类的方法 2. 模型模型使用seresnext50,单模型,貌似没有模型融合,考虑到senet自带类似于注意力机制的支路,能有效的捕捉不同类别间的精细特征,比较适合细粒度分类任务。senet基本结构如下: 3. 训练策略优化器使用adam学习率策略为cycle learning rate调参简单,不需要过于精细化的调整超参,亦不会出现严重的过拟合,考虑到数据量极其少并且存在少量错误标签,不会导致严重过拟合是十分有利的特点。 4. 额外流程——自蒸馏以应对标签错误的问题,流程示意图如下:首先使用5折交叉验证训练模型一,并保留每一折的验证集预测结果,然后按照ground truth:prediction=3:7的比例和原标签进行混合,最后训练模型二,得到最终结果 5. 模型效果检查与分析在炼丹过程中采用CAM来可视化观察卷积层的特征提取效果,由于目标类别极其相似,若模型有较好的效果,CAM图中的应该能清楚的看到卷积层关注叶片的病变区域,冠军团队得到的CAM图如下:可以看到,效果还是非常好的,卷积层关注到了区分不同类别的关键区域 6. 模型预测采用TTA、最后结果为平均加权 总的来说,方案中均为常用模型以及tricks,但是却十分有效,不愧是支付宝的炼丹团队

2

图像分类

Koolo233·2020-11-23 17:09 0 阅读 25
文章语义分割deeplab系列论文汇总
DeepLap 系列论文分割中仍存在的问题 1、不断降低的分辨率,连续的池化或下采样导致图像的分辨率大幅下降,从而损失了原始信息,且在上采样过程中难以恢复。因此,越来越多的网路都在试图减少分辨率的损失,比如使用空洞卷积,或者使用步长为2的卷积操作代替池化。实验证明,此类方法是有效的。 2、多尺度目标的存在,无法很好的分割。 3、定位精度比较低,对物体的边界分割不够细化。 DeepLap v1摘要 主要背景:DCNNs 的最后一层不足以进行精确分割目标。 主要贡献:本文将深度卷积神经网络和CRF(条件随机场)相结合,克服了深度网络的较差的局部化特征,同时模型创新性的将Hole(即空洞卷积)算法应用到DCNNs模型上。 网络效果:该网络在超过了以前的方法的精度。 介绍 DCNN 在图像标记任务中存在两个技术障碍:1)信号下采样;2)空间不敏感; 第一个问题涉及到:在DCNN中重复最大池化和下采样带来的分辨率下降问题,分辨率的下降会丢失细节。DeepLab是采用的atrous(带孔)算法扩展感受野,获取更多的上下文信息。 第二个问题涉及到:分类器获取以对象中心的决策是需要空间变换的不变性,这天然的限制了DCNN的定位精度,DeepLab采用完全连接的条件随机场(DenseCRF)提高模型捕获细节的能力。 主要贡献: 速度:带空洞卷积算法(atrous)的DCNN 可以保持8FPS的速度,全连接CRF平均推断需要0.5s. 准确:在PASCAL 语义分割挑战中获得了第二。 简单:DeepLab 是由两个非常成熟的模块(DCNN和CRFS)级联而成。 网络结构DCNN 作者为了加载预先在ImageNet训练好的VGG-16模型,并保证图片仅缩放了8倍做了如下修改: 把全连接层(fc6、fc7、fc8)改成卷积层(做分割)把最后两个池化层(pool4、pool5)的步长2改成1(padding 到原来大小)(保证feature的分辨率)把最后三个卷积层(conv5_1、conv5_2、conv5_3)的dilate rate设置为2,且第一个全连接层的dilate rate设置为4(保持感受野)把最后一个全连接层fc8的通道数从1000改为21(分类数为21) 为了加速训练,作者将第一个全连接层fc6,通道数从4096变为1024,卷积核大小从7x7变为3x3,后续实验中发现此处的dilate rate为12时(LargeFOV),效果最好。 空洞卷积: 蓝色部分是输入:7x7 image 青色部分为输出:3x3 image 空洞卷积核:3x3 kernel,采样率(扩张率)为2,无padding 这种带孔的采样又称atrous算法,可以稀疏的采样底层特征映射,该方法具有通常性,并且可以使用任何采样率计算密集的特征映射。在VGG16中使用不同采样率的空洞卷积,可以让模型再密集的计算时,明确控制网络的感受野。保证DCNN的预测图可靠的预测图像中物体的位置。 训练时将预训练的VGG16的权重做fine-tune,损失函数取是输出的特征图与ground truth下采样8倍做交叉熵和;测试时取输出图双线性上采样8倍得到结果。但DCNN的预测物体的位置是粗略的,没有确切的轮廓。在卷积网络中,因为有多个最大池化层和下采样的重复组合层使得模型的具有平移不变性,我们在其输出的high-level的基础上做定位是比较难的。这需要做分类精度和定位精度之间是有一个自然的折中。 CRF在语义分割上的应用: 对于每个像素位置ii具有隐变量xi(这里隐变量就是像素的真实类别标签,如果预测结果有21类,则(i∈1,2,..,21),还有对应的观测值yi(即像素点对应的颜色值)。以像素为节点,像素与像素间的关系作为边,构成了一个条件随机场(CRF)。通过观测变量yi来推测像素位置i对应的类别标签xi。条件随机场示意图如下: Fully connected CRF CNN是一个逐步提取特征的部分,原始位置信息会随着网络深度的增加而减少或消失。CRF在传统图像处理上的应用是做一个平滑。CRF简单说,是在决定一个位置的像素值时(paper里是label),会考虑周围像素点的值(label)。但是通过CNN得到的概率图在一定程度上已经足够平滑,所以短程的CRF没有太大的意义。于是考虑使用Fully connected CRF,这样就会综合考虑全局信息,恢复详细的局部结构,如精确图形的轮廓。CRF几乎可以用于所有的分割任务中图像精度的提高。 CRF是后处理,是不参与训练的,在测试时对特征提取后得到的得分图进行双线性插值,恢复到原图尺寸,然后再进行CRF处理,因为缩小8倍的,所以直接放大到原图是可以接受的。如果是32倍,则需要上采样(反卷积)。 第一列是原图像和Ground Truth;第二列是DCNN的输出,上面是得分图(Score map),下面是置信图(Belief map)。最后一个DCNN层的输出用作CRF的输入。后面三列分别是CRF迭代1、2、10次后的得分图和置信图。 CRF中使用的是softmax后的概率图。具体CRF请见论文。(比较难) 多尺度网路结构MLP 作者尝试融合了多层次网路信息,前四个pool层之后都额外的接了两个卷积层,第一个卷积层128 3x3,第二卷积层128 1x1,和最后一层concat之后会多了5*128=640个通道,然后进行sofrmax,训练这部分分支时,固定其它网络的参数。实验表示多尺度有助于提升预测结果,但是效果不如CRF明显。 Experiment 测试细节: 项目 设置 数据集 PASCAL VOC 2012 segmentation benchmark DCNN模型 权重采用预训练的VGG16 DCNN损失函数 交叉熵 训练器 SGD,batch=20 学习率 初始为0.001,最后的分类层是0.01。每2000次迭代乘0.1 权重 0.9的动量, 0.0005的衰减 DeepLab由DCNN和CRF组成,训练策略是分段训练,即DCNN的输出是CRF的一元势函数,在训练CRF时是固定的。在对DCNN做了fine-tune后,对CRF做交叉验证。具体参数请参考论文。 CRF和多尺度的表现 在验证集上的表现: 可以看到带CRF和多尺度的(MSc)的DeepLab模型效果明显上升了。 DeepLap v2摘要 主要贡献:充分利用空洞卷积,可在不增加参数的情况下有效的扩大感受野,合并了更多的上下文信息,DCNNs 和 CRF 相结合,进一步优化网络效果。提出了ASPP模型,ASPP用多个采样率和感受野的过滤器探测传入的卷积特征层,从而在多个尺度捕获目标和图像上下文,通过图模型(fully connected Conditional Random Field,DenseCRF)来精确分割目标的边界。 网络效果:ASPP 增强了网络的在多尺度下多类别分割的鲁棒性,使用不同的采样比例和感受野提取特征,能在多个尺度上捕获目标和上下文信息。 实验结果:在PASCAL VOC 2012 数据集上取得了79.7% miou, 在其他数据集上也进行了充分实验。 文中认为使用深度CNN网络进行语义分割所面临的问题与对应的解决方法: 1)尺寸减小的特征图:文中通过去掉最后几个下采样池化来提高卷积特征的大小,并且引入了空洞卷积在没有增加参数和计算量的同时增大特征图感受野。 2)多尺寸大小的目标:针对这个问题,文中考虑了空间金字塔池化的思想,这里只用不同比例的空洞卷积构造“空间金字塔结构”(Atrous Spatial Pyramid Pooling,ASPP)。 3)由于深度CNN具有平移不变性影响分割准确性,文中引入了全连接条件随机场(DenseCRF),使得分割边界的定位更加准确,从而解决该问题。 网络设计 1.网络结构 下图展示的是文章使用方法运算流程,整体的流程与v1的相差不大。在卷积网络、特征池化与CRF上做了一些改进。 2.空洞卷积 3.ASPP(从多尺度角度提取特征) 为了适应分割目标中不同的大小的目标文章提出了两种解决办法: 1)改变输入图像的尺寸构造图像金字塔,使用形同的CNN 网络去生成这些图的特征,这些特征恢复到原分辨率之后按位取最大值融合,使用该方法可以提高分割的性能,但是带来大量的计算开销。 2)在一个固定的特征图上使用不同的ratio 的空洞卷积并行的进行去提取特征,结构如下: 在模型最后进行像素分类之前增加一个类似Inception的结构, 包含不同rate的Atrous Conv.这样做的好处不仅有deeplabv1中提出的增加感受野的功能, 更重要的是可以增强模型对不同尺寸的同一物体语义信息的把握能力. 上图中为四路不同rate 的空洞卷积,kernel_size = 3,膨胀率分别为6、12、18、24.将feature map 通过四个空洞卷积后,将四个feature map 相加起来。然后进行8x upsampling(bilinear)。 4.引入了Fully-Connected CRF 原始CNN网络的输出见图5的第二列所示,可以看到该分割的结果非常粗糙,这是由于CNN网络的平移不变性导致的边界定位精度低,引入CRF来解决这个问题。 DeepLabv2较 Deeplabv1变化: ①引入ASPP能更好的处理多尺度物体的分割;这个有点和SPP类似. ②backbone由VGG-16改为ResNet,带来一定效果的提升。 实验结果 作者训练过程中使用的主要tricks有: (1)数据扩增:参照《Semantic contours from inverse detectors》将VOC 2012的训练集由1464扩增到10582张; (2)初始学习率、momentum以及CRF参数的选取都类似于DeepLabv1,通过调整膨胀速率来控制FOV(field-of-view),在DeepLab-LargeFOV上,作者对学习率下降策略进行了不同的尝试,相对于step的测试,poly策略在Batch_size=30,迭代20000次时能在VOC 2012验证集上达到65.88%的mIOU(未做CRF后处理); (3)其中(2)提到了LargeFOV,LargeFOV是指采用膨胀速率r=12的膨胀卷积策略,作者在VGG-16的fc6采用该膨胀卷积,并将fc7和fc8改为1x1的全卷积,命名为DeepLab-LargeFOV,作为对比,同时在VGG-16的fc6-fc8层用四路采用不同膨胀速率的膨胀卷积作用于各自的fc6,在fc7和fc8同样采用全卷积策略,命名为DeepLab-ASPP,根据膨胀卷积r的不同组合,r={2,4,8,12}的称为DeepLab-ASPP-S,r={6,12,18,24}的称为DeepLab-ASPP-L,如下图分别为VGG16上的DeepLab-LargeFOV和DeepLab-ASPP示意图: DeepLap v3摘要 主要贡献:为了解决多尺度下的分割问题,本文设置了级联和多尺度的空洞卷积模块,扩充了ASPP模块。(1)重新讨论了空洞卷积的使用,让我们可以在串行和空间金字塔池化的框架下,能够获得更大的感受野从而获得多尺度信息。(2)改进了ASPP模块,使用不同采用率的空洞卷积 和 BN层组成,我们尝试以串行或并行的方式布局模块。(3)讨论了一个小问题,使用大采样率的3x3 的空洞卷积,因为图像边界响应无法捕捉远距离信息(小目标),会退化为1x1的卷积,文中建议将图像级特征融合到ASSP模块中。 网络效果:网络在没有经过DenseCRF 后处理,也可以获得不错的结果。 实验结果:在PASCAL VOC 2012数据集中获得了与其他模型想当的性能。 语义分割常用特征提取框架(获取上下文信息): 1、图像金字塔:从输入图像入手,将不同尺度的图像分别送入网络进行提取特征,后期再进行特征融合。其中大尺度的输入提供了小目标的细节,小尺度的输入图像被编码成整体的上下文信息。这种方法有内存限制。 2、编解码结构:编码器部分利用下采样进行特征提取,解码器部分利用上采样还原特征图尺寸和对象细节。 3、深度网络&空洞卷积:经典分裂算法利用连续下采样提取特征,而空洞卷积是利用不同的采样率。 4、空间金字塔结构:除了ASPP外,仍有其他网络使用了该思想,如SPPNet,PSPNet等。 空洞卷积获取密集特征 完全卷积的方式被证明用在语义分割上有效,但是频繁的max-pooling 和 striding 导致特征图的空间分辨率严重降低,反卷积(FCNS) 已经被用来恢复空间分辨率,但是空洞卷积更加有效,原因是空洞卷积的感受野更大,在参数量不变的情况下,空洞卷积可以通过提高空洞率,让每次采样时在输入特征图上的重合采样区域更多,从而获得更密集的特征响应,举个例子:比如一个普通的卷积神经网络输入与输出尺寸之间差了32倍(down sampling 5次),当我们希望功能响应的空间密度加倍时,可以把最后一层的卷积或者池化层步幅stride设置为1,然后后续卷积替换为空洞率为2的空洞卷积,这样就能获得更密集的空间响应而不需要学习更多的参数。这里的密集,指的是特征图的分辨率,普通卷积的下采样操作会让特征图尺寸越来越小,最后得到的结果对应回原图的位置上,把每个特征图的像素点看作提取到的特征的话,整张图的特征响应就会变得稀疏。因此如果将最后一个缩小分辨率的卷积层替换成空洞卷积,可以 在保持运算量的同时不减小特征图分辨率,使输入与输出尺寸之间差16倍,这样就获得了更密集的特征响应。 注: 之前提到的密集与稀疏特征响应,指的是提取到的特征图的分辨率。普通卷积的下采样操作会让特征图尺寸越来越小,最后得到的结果对应回原图的位置上,把每个特征图的像素点看作提取到的特征的话,整张图的特征响应就会变得稀疏。而空洞卷积可以在运算量不变的情况下不降低特征图的分辨率,这样得到的特征图对应回原图上就更加密集。 在DCNN中,可以在一系列层中使用空洞卷积从而可以以任意的高分辨率计算网络的特征响应。将这种方法贯穿整个网络可以一直以原始分辨率计算特征图,但是计算成本过高,因此可以在效率和准确率之间做一个折中,将图像分类网络转化为一个密集特征提取器。 网络结构 上图是原来的ResNet模块,通过不断的下采样操作使得特征图的尺寸不断减小,同时深度也越来越深,在分类任务时是有好处的,在较深的模块更容易获得全局信息,但是对于分割网络来说这样损失了太多的细节,特征响应变得稀疏。但是通过将后几个block改为空洞卷积后,运算量没有增高大,但是却能够维持分辨率不降低,这样获取了更加密集的特征响应,使得还原到原图时细节更好。 空间金字塔池化(Atrous Spatial Pyramid Pooling ,ASPP) 拥有不同空洞率的ASPP有效的获取了多尺度信息,但是发现随着采样率增大,尤其是滤波器的等效尺寸接近特征图的尺寸时,整个3x3滤波器就会退化为1x1 滤波器,只是中间的一个权重值有用,为了克服这个问题,文章使用全局平均池化在最后一个特征图上,并且通过1x1 卷积得到256个特征上采样到所需空间尺寸,所以最终的ASPP包含四个并行操作:一个1x1卷积核三个3x3 卷积,都是256个卷积核,如下图: 所有分支得到的特征图将会拼接在一起通过另外一个1x1 卷积。最后通过16x上采样获得分割图。 DeepLap v3+摘要 背景概述:深度神经网络采用空间金字塔池化模块或编解码结构来完成语义分割任务。前者通过多分辨率、多感受野的滤波器或池化操作来探测输入的特征,从而能够编码多尺度的上下文信息。后者网络通过逐步恢复空间信息来捕捉更清晰的目标边界。 主要贡献:结合了两者的优点,通过添加一个简单有效的解码器模块来细化分割结果,特别是沿对象边界分割的结果。(1)提出了一种编码器-解码器结构,采用DeepLap v3作为encoder,添加decoder得到新的模型(DeepLap+)。(2)采用Xception模型来处理分割任务,并在ASPP模块和解码器模块上应用深度可分离卷积,从而产生更快更强的编解码器网络。 网络效果:该网络超过了以往方法的精度水平,可以更好的分割边界。 实验效果:在PASCAL VOC 2012 数据集和Cityscapes数据集中分别取得了89%和82.1%的miou。 介绍 在这项工作中,我们考虑两种类型的神经网络,使用空间金字塔池模块或encoder-decoder结构语义分割,前一个捕捉丰富的上下文信息,通过集中特性在不同的分辨率上,而后者是能够获得比较细化的物体边界。 (1)在Deeplab v3中和PSPNet 中,虽然丰富的语义信息被编码,但是在backbone网络多次下采样使得被分割的边缘信息丢失,导致边缘模糊。(2)使用空洞卷积可以减少下采样,缓解由于下采样过多而使得被分割物体的边缘信息丢失的问题,但是空洞卷积往往导致占用过多的内存,计算过于复杂,特别使当输出的分辨率是输入分辨率的1/8或者1/4时。(3)由于编码器-解码器结构不需要空洞卷积,其计算量往往较少。 我们尝试结合两种方法的优点,提出通过结合多尺度的上下文信息来丰富编解码网络中的编码器模块。 网路结构 空洞卷积:(1)控制前向传播过程中被计算的特征分辨率(2)调整感受野的大小,能够捕捉多尺度信息 Depthwise separable convolutional (深度可分离卷积) ​ 结构:一个depthwise convolution convolution + 一个1x1卷积层 ​ 作用:大大减少了计算复杂度 ​ 改进:将空洞卷积核depthwise convolution 进行结合,构成atrous separable convolution DeepLab v3作为编码结构 通过改变deeplab v3编码器最后一个(两个)模块的striding并改为使用空洞卷积,能够控制output stride为16(8)Deeplabv3改进了ASPP模块,有多个分支的不同rate的空洞卷积以及图像层特征 译码器 使用双线性插值法对编码器输出特征放大4倍将放大后的编码器输出特征和来自backbone的低层特征进行拼接使用1x1卷积对特征通道数进行压缩,使训练更容易,防止编码器输出特征信息被来自backbone的低层特征所压制 改进的Aligned Xception 使用更深的Xception结构,但是保留原有的entry flow network结构所有最大池化层改为 separable convolution 来进行下采样,从而能够在任意分辨率的特征图上提取特征在每个3x3卷积层后都加了BN和ReLU层 实验 数据集:PASCAL VOC2012,Cityspaces 评价指标:mIOU,Multiply-Adds 结果 1、来自低层的特征进行1x1的卷积后输出通道数的变化对结果的影响(resnet101) 2、解码器3x3 卷积结构变化对结果的影响(resnet101 backbone) 3、不同推理策略对结果的影响(resnet101 backbone) 从Table3可以看出,output stride=8时,计算量增加将近3倍,但是mIoU的提升却非常小。所以设置output stride=8得不偿失 4、以修改的Aligned Xception作为backbone的deeplabv3+不同推理策略的结果的影响 与Table 3中的结果相比,Aligned Xception在语义分割方面确实比resnet101有优势,在结果更好的情况下计算量更小 5、在Pascal Voc2012测试集上的结果 6、在CityScapes上的结果(Xception backbone) 常用实验内容 (1)模型变体(不同的主路径,不同的模块) (2)学习率相关实验 (3)不同数据集(多用几个数据集) (4)和之前的方法对比

0

语义分割

soaring flight ゼ·2020-11-23 15:12 0 阅读 18
文章Twitter情绪分析模型构建
FlyAI文献翻译英文原文::How to Build a Twitter Sentiment Analysis System 标签 :情感分类 In the field of social media data analytics, one popular area of research is the sentiment analysis of twitter data. Twitter is one of the most popular social media platforms in the world, with 330 million monthly active users and 500 million tweets sent each day. By carefully analyzing the sentiment of these tweets—whether they are positive, negative, or neutral, for example—we can learn a lot about how people feel about certain topics. 在社会化媒体数据分析领域,twitter数据的情感分析是一个热门的研究领域。Twitter是世界上最受欢迎的社交媒体平台之一,每月有3.3亿活跃用户,每天发送5亿条推文。通过仔细分析这些推文的情绪—例如是积极的、消极的还是中性的,我们可以了解到很多关于人们对某些话题的看法。 Understanding the sentiment of tweets is important for a variety of reasons: business marketing, politics, public behavior analysis, and information gathering are just a few examples. Sentiment analysis of twitter data can help marketers understand the customer response to product launches and marketing campaigns, and it can also help political parties understand the public response to policy changes or announcements. 理解推文的情绪是非常重要的,原因有很多:商业营销、政治、公众行为分析和信息收集只是几个例子。微博数据的情绪分析可以帮助营销人员了解客户对产品发布和营销活动的反应,也可以帮助政党了解公众对政策变化或公告的反应。 However, Twitter data analysis is no simple task. There are something like ~6000 tweets released every second. That’s a lot of Twitter data! And though it’s easy for humans to interpret the sentiment of a tweet, human sentiment analysis is simply not scalable. 然而,Twitter数据分析并不是一项简单的任务。每秒钟都有大约6000条推文发布。这是一个很大的Twitter数据!而虽然人们很容易解读一条推文的观点,但其中的情绪很难分析。 In this article, we’re going to look at building a scalable system for Twitter sentiment analysis, to help us better understand the role of machine learning in social media data analytics. 在本文中,我们将探讨如何构建一个可扩展的Twitter情感分析系统,以帮助我们更好地理解机器学习在社交媒体数据分析中的作用。 Problem: Identifying Negative Sentiment in TweetsIn this article, we’ll learn how to identify tweets with a negative sentiment. To do so, we’ll create a sentiment analyzer to classify positive and negative tweets in text format. Though we’ll be using our classifier for Twitter data analysis, it can also be used to analyze text data from other sources as well. 问题:识别推文中的负面情绪在本文中,我们将学习如何识别具有负面情绪的推文。为此,我们将创建一个情感分析器,对文本格式的积极和消极推文进行分类。虽然我们将使用我们的分类器进行Twitter数据分析,但它也可以用于分析其他来源的文本数据。 Through the course of the article, we are going to look at datasets, various text processing, and embedding techniques, and then employ a machine learning model to process our data. 在文章的过程中,我们将研究数据集、各种文本处理和嵌入技术,然后采用机器学习模型来处理我们的数据。 Twitter Sentiment Analysis Dataset Text Processing A. Cleaning of raw text B. Tokenization C. Stemming Word Embedding Techniques A. Bag of Words B. Term Frequency — Inverse Document Frequency C. Word2Vec Model Performance Metrics Results Summary Twitter情感分析数据集 文本处理A. 原始文本的清理B. 标识化C. 词干提取 单词嵌入 Word Embedding 技术A. Bag-of-words 模型B. TF-IDFC. Word2Vec 模型 性能指标 结果 概要 Twitter Sentiment Analysis DatasetLet’s start with our Twitter data. We will use the open-source Twitter Tweets Data for Sentiment Analysis dataset. It contains 32,000 tweets, of which 2,000 contain negative sentiment. The target variable for this dataset is ‘label’, which maps negative tweets to 1, and anything else to 0. Think of the target variable as what you’re trying to predict. For our machine learning problem, we’ll train a classification model on this data so it can predict the class of any new tweets we give it. A snapshot of the data is presented in the image below. Twitter情感分析数据集让我们从我们的Twitter数据开始。我们将使用开源的Twitter Tweets Data for Sentiment Analysis数据集。它包含32000条推文,其中2000条包含负面情绪。 这个数据集的目标变量是 “标签”,它将负面推文映射为1,其他任何东西都映射为0,将目标变量视为你试图预测的东西。对于我们的机器学习问题,我们将在这个数据上训练一个分类模型,这样它就可以预测我们给它的任何新推文的类别。 下图是数据的快照。 Text ProcessingData usually comes from a variety of different sources and is often in a variety of different formats. For this reason, cleaning your raw data is an essential part of preparing your dataset. However, cleaning is not a simple process, as text data often contain redundant and/or repetitive words. This is especially true in Twitter sentiment analysis, so processing our text data is the first step towards our solution. 文本处理数据通常来自各种不同的来源,并且通常采用各种不同的格式。出于这个原因,清理原始数据是准备数据集的一个重要部分。然而,清洗并不是一个简单的过程,因为文本数据通常包含冗余和/或重复的单词。这在Twitter情感分析中尤其如此,因此处理我们的文本数据是我们解决方案的第一步。 The fundamental steps involved in text processing are: A. Cleaning of Raw Data B. Tokenization C. Stemming 文本处理涉及的基本步骤是 A. 原始数据的清洗 B. 标识化 C. 梗概 A. Cleaning of Raw DataThis phase involves the deletion of words or characters that do not add value to the meaning of the text. Some of the standard cleaning steps are below: A. 原始数据的清理这一阶段涉及删除对文本意义没有附加价值的词语或字符。以下是一些标准的清理步骤。 Lowering case Removal of mentions Removal of special characters Removal of stopwords Removal of hyperlinks Removal of numbers Removal of whitespaces 修正大小写 删除艾特 去除特殊字符 刪除停顿词 删除超链接 删除数字 删除空白文本结点 Lowering Case Lowering the case of text is essential for the following reasons:The words, ‘Tweet’, ‘TWEET’, and ‘tweet’ all add the same value to a sentence.Lowering the case of all the words helps to reduce the dimensions by decreasing the size of the vocabulary. 修正大小写 修正文字的大小写是必要的,原因如下。 “Tweet”、”TWEET “和 “tweet “这三个词对一个句子的附加值都是一样的。 将所有单词的大小写降低,有助于减少词的大小,从而减少维度。 def to_lower(word): result = word.lower() return result Removal of mentions Mentions are very common in tweets. However, as they don’t add value for interpreting the sentiment of a tweet, we can remove them. Mentions always come in the form of ‘@mention’, so we can remove strings that start with ‘@’. To achieve this on the entire dataset, we use the function below. 删除提及的内容 在推特中,被艾特是非常常见的。然而,由于它们并不能为解读一条微博的情感增加价值,我们可以删除它们。提及总是以”@mention “的形式出现,所以我们可以删除以”@”开头的字符串。 为了在整个数据集上实现这一点,我们使用下面的函数。 def to_lower(word): result = word.lower() return result Removal of special characters This text processing technique will help to treat words like ‘hurray’ and ‘hurray!’ in the same way. At this stage, we remove all punctuation marks. 去除特殊字符 这种文本处理技术将有助于以同样的方式处理 “hurray “和 “hurray!”等词语。在这个阶段,我们删除所有标点符号。 def remove_special_characters(word): result = word.translate(str.maketrans(dict.fromkeys(string.punctuation))) return result Removal of stopwords Stopwords are commonly occurring words in a language, such as ‘the’, ‘a’, ‘an’, ‘is’ etc. We can remove them here because they won’t provide any valuable information for our Twitter data analysis. 刪除停顿词 歇后语是语言中经常出现的词语,如 “the”、”a”、”an”、”is “等。我们可以在这里删除它们,因为它们不会为我们的Twitter数据分析提供任何有价值的信息。 def remove_stop_words(words): result = [i for i in words if i not in ENGLISH_STOP_WORDS] return result Removal of hyperlinks Now we can remove URLs from the data. It’s not uncommon for tweets to contain URLs, but we won’t need to analyze them for our task. 删除超链接 现在我们可以从数据中删除URL。推文中包含URL的情况并不少见,但我们不需要为我们的任务分析它们。 def remove_hyperlink(word): return re.sub(r"http\S+", "", word) B. Tokenization Tokenization is the process of splitting text into smaller chunks, called tokens. Each token is an input to the machine learning algorithm as a feature. NLTK (Natural Language Toolkit) provides a utility function for tokenizing data. B. 标识化 标记化是将文本分割成较小块的过程,称为标记。每个标记都是机器学习算法的输入,作为一个特征。NLTK(Natural Language Toolkit)为标记化数据提供了一个实用功能。 from nltk.tokenize import word_tokenize tweets_data['tweet'] = tweets_data['tweet'].apply(word_tokenize) C. Stemming Stemming is the process of removing and replacing suffixes from a token to obtain the root or base form of the word. This is called a ‘stem’. For example, the stem for the words, ‘satisfied’, ‘satisfaction’, and ‘satisfying’ is ‘satisfy’ and all of these imply the same feeling. Porter stemmer is a widely used stemming technique. nltk.stem provides the utility function to stem ‘PorterStemmer’ C. 词干提取 词干提取是指从一个词缀中去掉和替换后缀以获得词根或词基的过程。这就是所谓的 “词干”。例如,” satisfied “、” satisfaction “和 “satisfying “等词的词干是 “ satisfy “,所有这些词都是相同的意思。 Porter stemmer(英文分词算法)是一种被广泛使用的词干提取技术。ltk.stem提供了实用的功能, 可以适用英文分词算法。 from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() def stem_words(text): return " ".join([stemmer.stem(word) for word in text]) tweets_data['tweet'] = tweets_data['tweet'].apply(lambda text: stem_words(text)) Word Embedding Techniques There is a huge amount of data in text format. Analyzing text data is an extremely complex task for a machine as it’s difficult for a machine to understand the semantics behind text. At this stage, we’re going to process our text data into a machine-understandable format using word embedding. 词嵌入技术 有大量的数据是文本格式的。对机器来说,分析文本数据是一项极其复杂的任务,因为机器很难理解文本背后的语义。现阶段,我们要用词嵌入的方法将文本数据处理成机器可以理解的格式。 Word Embedding is simply converting data in a text format to numerical values (or vectors) so we can give these vectors as input to a machine, and analyze the data using the concepts of algebra. Word嵌入只是将文本格式的数据转换为数值(或矢量),这样我们就可以将这些矢量作为输入给机器,用代数的概念来分析数据。 However, it’s important to note that when we perform this transformation there could be data loss. The key then is to maintain an equilibrium between conversion and retaining data. Here are two commonly used terminologies when it comes to this step. 但是,需要注意的是,当我们进行这种转换时,可能会有数据丢失。那么,关键是要在转换和保留数据之间保持一个平衡。 这里有两个常用的术语,当涉及到这个步骤时。 Each text data point is called a Document An entire set of documents is called a Corpus 每个文本数据点称为一个Document 一整套 Document 称为语料库Corpus Text processing can be done using the following techniques: Bag of Words TF-IDF Word2Vec Next, let’s explore each of the above techniques in more detail, then decide which to use for our Twitter sentiment analysis model. 可以使用以下技术进行文本处理。 1.Bag of Words模型 2.TF-IDF 3.Word2Vec 接下来,让我们更详细地探讨上述每一种技术,然后决定使用哪种技术来建立我们的Twitter情感分析模型。 A. Bag of Words Bag of Words does a simple transformation of the document to a vector by using a dictionary of unique words. This is done in just two steps, outlined below. A. Bag of Words模型 Bag of Words通过使用独特的单词字典将文档简单地转换为矢量。这只需两步就能完成,概述如下。 Construction of Dictionary Create a dictionary of all the unique words in the data corpus in a vector form. Let the number of unique words in the corpus be, ‘d’. So each word is a dimension and hence this dictionary vector is a d-dimension vector. 字典的构建 将数据语料库中所有唯一的词以向量形式创建一个字典。让语料库中唯一词的数量为,’d’。所以每个词都是一个维度,因此这个字典向量是一个d维向量。 Construction of Vectors For each document, say, rᵢ we create a vector, say, vᵢ. This vᵢ which has d-dimensions can be constructed in two ways: 向量的构造 对于每一个文档,比如说rᵢ,我们创建一个向量,比如说vᵢ。这个有d维的vᵢ可以用两种方式构造。 Construction of Vectors For each document, say, rᵢ we create a vector, say, vᵢ. This vᵢ which has d-dimensions can be constructed in two ways: For each document, the vᵢ is constructed in accordance with the dictionary such that each word in the dictionary is reproduced by the number of times that word is present in the document. For each document, the vᵢ is constructed in accordance with the dictionary such that each word in the dictionary is reproduced as: 1 if the word exists in the document or 0 if the word doesn’t exist in the document This type is known as a Binary Bag of Words. Now we have vectors for each document and a dictionary with a set of unique words from the data corpus. These vectors can be analyzed by, plotting in d-dimension space or calculating distance between vectors to get the similarity (the closer the vectors are, the more similar they are) 向量的构造 对于每一个文档,比如说rᵢ,我们创建一个向量,比如说vᵢ。 这个有d维的vᵢ可以用两种方式构造。 1.对于每一个文档,vᵢ都是按照字典来构造的,这样字典中的每一个词都是按照该词在文档中出现的次数来重现的。 2.对于每一个文档,vᵢ按照字典构造,使字典中的每一个词被重现为。 如果该词在文档中存在或如果该词在文档中不存在,则为0,这种类型被称为 “二进制 Bag of Words “。 现在,我们有了每个文档的向量和一本字典,其中有一组来自数据语料库的唯一词汇。这些向量可以通过以下方式进行分析。 在d维空间绘制或计算向量之间的距离,得到相似度(向量越近,相似度越高)。 B. Term Frequency — Inverse Document Frequency There are three elements here : word, document, corpus. Term Frequency — Inverse Document Frequency, or TF-IDF for short, uses the relationship between these elements to convert text data into vectors. B. 术语频率—反向文件频率 这里有三个元素:词、文档、语料库。术语频率—反文档频率,简称TF-IDF,利用这些元素之间的关系,将文本数据转化为向量。 Term Frequency refers to the relationship between a word and a document. Whereas Inverse Document Frequency refers to the relationship between a word and the corpus. 术语频率指的是一个词和一个文档之间的关系。而逆文档频率则是指一个词和语料库之间的关系。 Calculating Term Frequency Term frequency is the probability of the word wⱼ in the document rᵢ. It is calculated as below. 期限频率的计算 词频是指词wⱼ在文档中出现的概率rᵢ。它的计算方法如下。 The mathematical formula to calculate term frequency of each word in a review High Term Frequency of a word in a review implies the word is frequent in that review. Low Term Frequency of a word in a review implies the word is rare in that review. 复习中计算各词词频的数学公式。 一个词在一篇评论中的高频率意味着该词在该评论中是频繁出现的。低频词在一篇评论中的频率意味着该词在该评论中是罕见的。 Calculating IDF Inverse Document Frequency (IDF) says how frequently a word occurs in the entire corpus. This is calculated as below. IDF的计算 逆文档频率(IDF)表示一个词在整个语料库中出现的频率。其计算方法如下。 The mathematical formula to calculate Inverse Document Frequency Low Inverse Document Frequency implies the word is frequent in the corpus. High Inverse Document Frequency implies the word is rare in the corpus. We use logarithm instead of simple inverse ratio because of scaling. Term Frequency is a probability and ranges between 0 and 1. The inverse ratio of this can take values from 0 to infinity and can bias the IDF. Using log to solve this problem is one simple and highly accepted reason. TF-IDF of a word in the review = TF(word, review) * IDF(word, corpus). In the vector form of each document, we have this TF-IDF of each word. Converting a document into a vector, using TF-IDF values is called TF-IDF vectorization. TF-IDF vectorization gives high importance to words which are: frequent in a document (from TF) rare in the corpus (from IDF) 计算逆向文档频率的数学公式如下 低逆向文档频率意味着该词在语料库中频繁出现。高逆文档频率意味着该词在语料库中是罕见的。 我们使用对数而不是简单的反比,因为缩放的关系。术语频率是一个概率,范围在0到1之间,这个反比的值可以从0到无穷大,会使IDF产生偏差。使用对数来解决这个问题是一个简单且被高度认可的理由。 审稿中某个词的TF-IDF=TF(词,审稿)*IDF(词,语料库)。 在每个文档的向量形式中,我们就有了这个每个词的TF-IDF。利用TF-IDF值,将文档转换为向量,就叫做TF-IDF向量化。 TF-IDF向量化对那些在文档中频繁出现的词给予高度重视。 频繁出现在文档中的词(来自TF) 罕见于语料库 (来自IDF) C. Word2Vec In Bag of Words and TF-IDF, we convert sentences into vectors. But in Word2Vec, we convert words into vectors. Hence the name, word2vec! Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus assigned to a corresponding vector in the space. The positioning of word vectors is done in such a way that words with common contexts in the corpus are located closer in space. For example, the vector from man to woman is parallel to king to queen, etc. C. Word2Vec 在Bag of Words和TF-IDF中,我们将句子转换成向量。但在Word2Vec中,我们将单词转换为向量。因此,它的名字叫word2vec Word2Vec将一个大型的文本语料库作为输入,并产生一个向量空间,通常有几百个维度,语料库中的每一个独特的单词都被分配到空间中相应的向量。词向量的定位方式是:语料库中具有共同语境的词在空间中的位置比较接近。 例如,从男人到女人的向量与国王到皇后的向量是平行的,等等。 When to use what? When it comes to which embedding technique to use for a machine learning model, there is no obvious answer: it really depends on the use-case. Bag of Words is commonly used for document classification applications where the occurrence of each word is used as a feature for training a classifier. TF-IDF is used by search engines like Google, as a ranking factor for content. Word2vec is great when an application requires a lot of information like translating documents. For our Twitter sentiment analysis, we’ll use ‘Bag of Words’ as a word embedding technique. The Scikit learn library provides a ‘CountVectorizer’ function to perform Bag of Words. Using ‘CountVectorizer’, we transform our processed data into vectors. 什么时候使用什么? 当涉及到为机器学习模型使用哪种嵌入技术时,没有明显的答案:它真的取决于使用情况。 Bag of Words通常用于文档分类应用,其中每个单词的出现率被用作训练分类器的特征。 TF-IDF被Google等搜索引擎使用,作为内容的排名因素。 当一个应用需要大量的信息时,比如翻译文档,Word2vec是非常好的 对于我们的Twitter情感分析,我们将使用’Bag of Words’作为单词嵌入技术。Scikit学习库提供了一个’CountVectorizer’函数来执行Bag of Words。利用’CountVectorizer’,我们将处理后的数据转化为向量。 from sklearn.feature_extraction.text import CountVectorizer bow=CountVectorizer( min_df=2, max_features=100000) bow.fit(tweets_data['tweet']) tweets_processed =bow.transform(tweets_data['tweet']).toarray() Model Fitting Logistic Regression is a supervised machine learning classification algorithm widely used by internet applications. It is the simplest algorithm to solve classification problems, but highly efficient. We’ll use this to get a probability of sentiment in our Twitter data analysis. Using sklearn.linear_model, we can implement logistic regression. This model outputs the probability of the input belonging to the class, making it possible for us to do sentiment analysis of twitter data on new tweets. 模型装配 Logistic Regression是一种被互联网应用广泛使用的监督式机器学习分类算法。它是解决分类问题最简单的算法,但效率很高。在我们的Twitter数据分析中,我们将使用它来获得情感的概率。 使用sklearn.linear_model,我们可以实现逻辑回归。这个模型输出了输入的属于类的概率,使得我们可以对新的推文进行twitter数据的情感分析。 from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(tweets_train, target_train) # training the model prediction = model.predict_proba(tweets_test) # predicting on the test set prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 then 1 else 0 prediction_int = prediction_int.astype(np.int) Performance Metrics Now that we have a twitter sentiment analysis model that can output a probability of a tweet belonging to a particular class, we need some way to judge its performance. Precision and recall are the two most widely used performance metrics for a classification model. 性能指标 现在,我们已经有了一个可以输出一条推文属于某一类的概率的twitter情感分析模型,我们需要一些方法来判断它的性能。精确度和召回率是分类模型最广泛使用的两个性能指标。 Precision is the fraction of the relevant instances from all the retrieved instances. It helps us to understand the usefulness of the results. Recall is the fraction of relevant instances from all the relevant instances. Recall helps us understand the coverage of the results. 精确度是指所有检索到的实例中相关实例的分数。它帮助我们了解结果的有用性。 回收率是指从所有相关实例中提取的相关实例的分数。Recall有助于我们了解结果的覆盖率。 The F1 Score is the harmonic mean of precision and recall. For example, consider that a search query results in 30 pages, of which 20 are relevant, but the results fail to display 40 other relevant results. In this case, the precision is 20/30, and recall is 20/60. Therefore, our F1 Score is 4/9. F1得分是精度和召回率的谐波平均值。 例如,考虑到搜索查询的结果是30个页面,其中20个是相关的,但结果未能显示40个其他相关结果。在这种情况下,精度为20/30,召回率为20/60。因此,我们的F1得分是4/9。 Using F1-score as a performance metric for classification problems is a good choice. 使用F1-score作为分类问题的性能指标是一个不错的选择。 from sklearn.metrics import f1_score from sklearn.metrics import confusion_matrix,f1_score, precision_score,recall_score cf_matrix =confusion_matrix(target_test, prediction_int) tn, fp, fn, tp = confusion_matrix(target_test, prediction_int).ravel() print("Precision: {:.2f}%".format(100 * precision_score(target_test, prediction_int))) print("Recall: {:.2f}%".format(100 * recall_score(target_test, prediction_int))) print("F1 Score: {:.2f}%".format(100 * f1_score(target_test, prediction_int))) import seaborn as sns import matplotlib.pyplot as plt ax= plt.subplot() #annot=True to annotate cells sns.heatmap(cf_matrix, annot=True, ax = ax,cmap='Blues',fmt=''); # labels, title and ticks ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels'); ax.set_title('Confusion Matrix'); ax.xaxis.set_ticklabels(['Positive', 'Negative']); ax.yaxis.set_ticklabels(['Positive', 'Negative']); Results A model with an F1 score of 73% is a good-to-go model using traditional machine learning algorithms. However, there are ways to improve the model. We can use deep learning techniques (though these are expensive), and we can respond to results and feedback by adding features and removing misspelled words. Also, keep in mind that these results are based on our training data. When applying a sentiment analysis model to real-world data, we still have to actively monitor the model’s performance over time. 结果 一个模型的F1得分为73%,使用传统的机器学习算法是一个好用的模型。然而,有一些方法可以改进模型。我们可以使用深度学习技术(虽然这些技术很昂贵),我们可以通过添加特征和删除拼写错误的单词来响应结果和反馈。 另外,请记住,这些结果是基于我们的训练数据。当把情感分析模型应用到真实世界的数据中时,我们仍然要积极监控模型在一段时间内的表现。 Summary: Tips for Twitter Sentiment AnalysisIn this article, we learned various text processing and word embedding techniques, and implemented a Twitter sentiment analysis classification model on processed data. Hopefully this will give you an idea of how these social media data analytics systems work, and the sort of work required to prepare and deploy them. The text processing techniques mentioned in this article are widely performed on text data. However, we don’t have to perform all the techniques all the time. It’s important to carefully choose the processing and embedding steps based on our use case; this will play an important role in the sentiment analysis data. In the world of social media data analytics, and especially with twitter data analysis, it’s often important to have the support of a domain expert for each step of your process. Vocabulary on social networks is often unique to particular communities, and domain experts can help you to avoid data bias and improve the accuracy of your dataset and analysis. That said concepts and techniques learned in this article can be applied to a variety of natural language processing problems. Outside of Twitter sentiment analysis, you can also use similar techniques for building chatbots, text summarization, spam detection and language translation models. Thanks for reading! If you would like to experiment with this custom dataset yourself, you can download the data and see the complete code on Github. If you’d like to experiment with other Twitter datasets, here’s a repository for a variety of different Twitter content. 总结:Twitter情感分析的技巧在这篇文章中,我们学习了各种文本处理和词嵌入技术,并在处理后的数据上实现了一个Twitter情感分析分类模型。希望这能让你了解这些社交媒体数据分析系统的工作原理,以及准备和部署这些系统需要做什么样的工作。 本文中提到的文本处理技术广泛地在文本数据上进行。然而,我们并不需要一直执行所有的技术。重要的是要根据我们的用例仔细选择处理和嵌入步骤;这将在情感分析数据中发挥重要作用。 在社会化媒体数据分析的世界里,尤其是在twitter数据分析的世界里,通常情况下,你的每一个步骤都需要得到领域专家的支持。社交网络上的词汇往往是特定社区所独有的,领域专家可以帮助你避免数据偏差,提高数据集和分析的准确性。 也就是说本文中学习的概念和技术可以应用于各种自然语言处理问题。在Twitter情感分析之外,你还可以使用类似的技术来构建聊天机器人、文本摘要、垃圾邮件检测和语言翻译模型。 谢谢阅读!如果你想自己实验这个自定义数据集,可以在Github上下载数据并查看完整的代码。如果你想实验其他Twitter数据集,这里有一个各种不同Twitter内容的资源库。

0

自然语言处理

AI小助手·2020-11-21 14:41 0 阅读 33
文章开源巨献:27个深度强化学习算法的实例项目
FlyAI文献翻译英文原文::Deep Reinforcement Learning Nanodegree Algorithms 标签 :强化学习 Here you can find several projects dedicated to the Deep Reinforcement Learning methods. The projects are deployed in the matrix form: [env x model], where env is the environment to be solved, and model is the model/algorithm which solves this environment. In some cases,the same environment is resolved by several algorithms. All projects are presented as a jupyter notebook containing training log. The following environments are supported: AntBulletEnv, Bipedalwalker, CarRacing, CartPole, Crawler, HalfCheetahBulletEnv, HopperBulletEnv, LunarLander, LunarLanderContinuous, Markov Decision 6x6, Minitaur, Minitaur with Duck, Pong, Navigation, Reacher, Snake, Tennis, Waker2DBulletEnv. 在这里,您可以找到几个致力于“深度强化学习”方法的项目。 项目以矩阵形式部署:[env x model],其中env是要解决的环境,而model是解决该环境的模型/算法。 在某些情况下,可以通过几种算法来解决同一环境。 所有项目均以包含培训日志的Jupyter笔记本的形式呈现。 支持以下环境: AntBulletEnv, Bipedalwalker, CarRacing, CartPole, Crawler, HalfCheetahBulletEnv, HopperBulletEnv, LunarLander, LunarLanderContinuous, Markov Decision 6x6, Minitaur, Minitaur with Duck, Pong, Navigation, Reacher, Snake, Tennis, Waker2DBulletEnv. Four environments (Navigation, Crawler, Reacher, Tennis) are solved in the framework of the Udacity Deep Reinforcement Learning Nanodegree Program. 在Udacity深度强化学习学位计划的框架内解决了四个环境(导航,爬虫,到达者,网球)。 Monte-Carlo Methods In Monte Carlo (MC), we play episodes of the game until we reach the end, we grab the rewards collected on the way and move backward to the start of the episode. We repeat this method a sufficient number of times and we average the value of each state. 蒙特卡洛方法 在蒙特卡洛(MC)中,我们玩游戏的情节直到到达终点,我们从途中获得了奖励然后返回情节的开始。 我们重复此方法至足够的次数,然后平均每个状态的值 Temporal Difference Methods and Q-learning Reinforcement Learning in Continuous Space (Deep Q-Network) Function Approximation and Neural Network 时差方法与Q学习 连续空间中的强化学习(深度Q网络) 函数逼近和神经网络 The Universal Approximation Theorem (UAT) states that feed-forward neural networks containing a single hidden layer with a finite number of nodes can be used to approximate any continuous function provided rather mild assumptions about the form of the activation function are satisfied. 通用逼近定理(UAT)规定,只要满足有关激活函数形式的轻微假设,就可以使用包含具有有限数量节点的单个隐藏层的前馈神经网络来近似任何连续函数。 Policy-Based Methods, Hill-Climbing, Simulating AnnealingRandom-restart hill-climbing is a surprisingly effective algorithm in many cases. Simulated annealing is a good probabilistic technique because it does not accidentally think a local extrema is a global extrema. 基于策略的方法,爬山,模拟退火 在许多情况下,随机重启爬山是一种出奇的有效算法。 模拟退火是一种很好的概率技术,因为它不会偶然错误地将局部极值作为全局极值。 Policy-Gradient Methods, REINFORCE, PPODefine a performance measure J(\theta) to maximaze. Learn policy paramter \theta throgh approximate gradient ascent. 策略渐变方法,REINFORCE,PPO 定义一个性能指标J(\ theta)以使其最大化。 通过近似梯度上升来学习策略参数\ theta。 Actor-Critic Methods, A3C, A2C, DDPG, TD3, SAC The key difference from A2C is the Asynchronous part. A3C consists of multiple independent agents(networks) with their own weights, who interact with a different copy of the environment in parallel. Thus, they can explore a bigger part of the state-action space in much less time. 关键行为法,A3C,A2C,DDPG,TD3,SAC A3C与A2C的主要区别在于异步部分。 A3C由具有权重的多个独立代理(网络)组成,它们与环境的不同副本并行进行交互。 因此,他们可以在更少的时间内探索状态-行动空间的更大部分。 Projects, models and methods 项目,模型和方法 AntBulletEnv, Soft Actor-Critic (SAC) BipedalWalker, Twin Delayed DDPG (TD3) BipedalWalker, PPO, Vectorized Environment BipedalWalker, Soft Actor-Critic (SAC) BipedalWalker, A2C, Vectorized Environment CarRacing with PPO, Learning from Raw Pixels CartPole, Policy Based Methods, Hill Climbing CartPole, Policy Gradient Methods, REINFORCE Cartpole, DQN Cartpole, Double DQN HalfCheetahBulletEnv, Twin Delayed DDPG (TD3) HopperBulletEnv, Twin Delayed DDPG (TD3) HopperBulletEnv, Soft Actor-Critic (SAC) LunarLander-v2, DQN LunarLanderContinuous-v2, DDPG Markov Decision Process, Monte-Carlo, Gridworld 6x6 MinitaurBulletEnv, Soft Actor-Critic (SAC) MinitaurBulletDuckEnv, Soft Actor-Critic (SAC) Pong, Policy Gradient Methods, PPO Pong, Policy Gradient Methods, REINFORCE Snake, DQN, Pygame Udacity Project 1: Navigation, DQN, ReplayBuffer Udacity Project 2: Continuous Control-Reacher, DDPG, environment Reacher (Double-Jointed-Arm) Udacity Project 2: Continuous Control-Crawler, PPO, environment Crawler Udacity Project 3: Collaboration_Competition-Tennis, Multi-agent DDPG, environment Tennis Walker2DBulletEnv, Twin Delayed DDPG (TD3) Walker2DBulletEnv, Soft Actor-Critic (SAC) Projects with DQN and Double DQNCartpole, DQN Cartpole, Double DQN LunarLander-v2, DQN Navigation, DQN Snake, DQN, Pygame Projects with PPOPong, 8个并行代理 CarRacing, 单一代理,从像素中学习 C r a w l e r , 12 个并行代理 BipedalWalker, 16 个并行代理 Projects with TD3BipedalWalker HalfChhetahBulletEnv HopperBulletEnv Walker2DBulletEnv Projects with Soft Actor-Critic (SAC)AntBulletEnv BipedalWalker HopperBulletEnv MinitaurBulletEnv MinitaurBulletDuckEnv Walker2dBulletEnv BipedalWalker,与不同模型的混合BipedalWalker, Twin Delayed DDPG (TD3) BipedalWalker, PPO, Vectorized Environment BipedalWalker, Soft-Actor-Critic (SAC) BipedalWalker, A2C, Vectorized Environment CartPole与不同模型的混合CartPole, Policy Based Methods, Hill Climbing CartPole, Policy Gradient Methods, REINFORCE Cartpole with Deep Q-Learning Cartpole with Doouble Deep Q-Learning 更多链接有关Policy-Gradient Methods策略梯度方法,参见 1, 2, 3. 有关 REINFORCE,参见 1, 2, 3. 有关 PPO,参见 1, 2, 3, 4, 5. 有关 DDPG,参见 1, 2. 有关 Actor-Critic Methods和A3C,参见 1, 2, 3, 4. 有关 TD3,参见 1, 2, 3 有关 SAC,参见 1, 2, 3, 4, 5 有关 A2C,参见 1, 2, 3, 4, 5 Papers on TowardsDataScience TowardsDataScience网站上的文章 How does the Bellman equation work in Deep Reinforcement Learning? 贝尔曼方程式在深度强化学习中如何工作? A pair of interrelated neural networks in Deep Q-Network 深度Q网络中一对相互关联的神经网络 Three aspects of Deep Reinforcement Learning: noise, overestimation and exploration 深度强化学习的三个方面:噪声,高估和探索 我在上述项目中开发的相关视频Four BipedalWalker Gaits BipedalWalker by Training Stages CarRacing by Training Stages Lucky Hopper Martian Ant Lunar Armada Wooden Snake Walking through the chess fields Artificial snake on the way Learned Long Snake Such a fast cheetah Four stages of Minitaur training

1

深度学习

机器学习

AI小助手·2020-11-20 15:37 1 阅读 51
文章教程:使用BYOL轻松进行自监督学习
FlyAI文献翻译英文原文::Easy Self-Supervised Learning with BYOL 标签 :图像分类 Note: All code from this article is available in this Google Colab notebook. You can use Colab’s free GPU to run/modify the experiments yourself. 注:本文所有代码可见Google Colab notebook,你可用Colab的免费GPU运行或改进。 Self-Supervised LearningToo often in deep learning, there just isn’t enough labelled data. Manually labeling data is too time intensive, and outsourcing the labor can be prohibitively expensive for small companies or individuals. Self-supervised learning is a nascent sub-field of deep learning, which aims to alleviate your data problems by learning from unlabeled samples. The goal is simple: train a model so that similar samples have similar representations. Accomplishing that is (usually) not so simple, but years of research from leaders like Google have greatly advanced this field. Before BYOL, most attempts at self-supervised learning could be categorized as either contrastive or generative learning methods. Generative learning uses GANs to model the complete data distribution, which can be extremely computationally expensive. Contrastive methods are much less expensive. As described by the authors of BYOL: Contrastive approaches avoid a costly generation step in pixel space by bringing representation of different views of the same image closer (‘positive pairs’), and spreading representations of views from different images (‘negative pairs’) apart. 自监督学习 在深度学习中,经常遇到的问题是没有足够的标记数据,而手工标记数据耗费大量时间且人工成本高昂。基于此,自我监督学习成为深度学习的研究热点,旨在从未标记样本中进行学习,以缓解数据标注困难的问题。子监督学习的目标很简单,即训练一个模型使得相似的样本具有相似的表示,然而具体实现却困难重重。经过谷歌这样的诸多先驱者若干年的研究,子监督学习如今已取得一系列的进步与发展。 在BYOL之前,多数自我监督学习都可分为对比学习或生成学习,其中,生成学习一般GAN建模完整的数据分布,计算成本较高,相比之下,对比学习方法就很少面临这样的问题。对此,BYOL的作者这样说道: 通过对比方法,同一图像不同视图的表示更接近(正例),不同图像视图的表示相距较远(负例),通过这样的方式减少表示的生成成本。 For this to work well, though, we must compare each sample to many other negative samples. This is problematic, because it introduces instabilities into our training, and reinforces systematic biases from the dataset. The BYOL authors describe this very clearly: 为了实现对比方法,我们必须将每个样本与其他许多负例样本进行比较。然而这样会使训练很不稳定,同时会增大数据集的系统偏差。BYOL的作者显然明白这点: Contrastive methods are sensitive to the choice of image augmentations. For instance, SimCLR does not work well when removing color distortion from its image augmentations. As an explanation, SimCLR shows that crops of the same image mostly share their color histograms. At the same time, color histograms vary across images. Therefore, when a contrastive task only relies on random crops as image augmentations, it can be mostly solved by focusing on color histograms alone. As a result the representation is not incentivized to retain information beyond color histograms. 对比方法对图像增强的方式非常敏感。例如,当消除图像增强中的颜色失真时,SimCLR表现不佳。可能的原因是,同一图像的不同裁切一般会共享颜色直方图,而不同图像的颜色直方图是不同的。因此,在对比任务中,可以通过关注颜色直方图,使用随机裁切方式实现图像增强,其结果表示几乎无法保留颜色直方图之外的信息。 This also occurs for other types of data transformations — not just color distortions. In general, contrastive training will be sensitive to systematic biases in your data. Data bias is a wide-spread issue in machine learning (see: facial recognition for women and minorities), and it’s a very serious problem for contrastive methods. Luckily, BYOL does not depend on negative sampling, which provides an escape from this problem. 不仅仅是颜色失真,其他类型的数据转换也是如此。一般来说,对比训练对数据的系统偏差较为敏感。在机器学习中,数据偏差是一个广泛存在的问题(见facial recognition for women and minorities),这对对比方法来说影响更大。不过好在BYOL不依赖负采样,从而很好的避免了该问题。 BYOL: Bootstrap Your Own LatentThe goal of BYOL is similar to contrastive learning, but with one big difference. BYOL does not worry about whether dissimilar samples have dissimilar representations (the contrastive part of contrastive learning). We only care that similar samples have similar representations. This may seem like a subtle difference, but it has big implications for training efficiency and generalization: BYOL:Bootstrap Your Own Latent(发掘自身潜能) BYOL的目标与对比学习相似,但一个很大的区别是,BYOL不关心不同样本是否具有不同的表征(即对比学习中的对比部分),仅仅使相似的样品表征类似。看上去似乎无关紧要,但这样的设定会显著改善模型训练效率和泛化能力: Training is more efficient, because BYOL does not require negative sampling. We only sample each training example once per epoch. The negative counterparts can be ignored altogether. Our model is less sensitive to systematic biases in the training dataset. Usually, this means that it generalizes better to unseen examples. 由于不需要负采样,BLOY有更高的训练效率。在训练中,每次遍历只需对每个样本采样一次,而无需关注负样本。 BLOY模型对训练数据的系统偏差不敏感,这意味着模型可以对未见样本也有较好的适用性。 BYOL minimizes the distance between representations of each sample and a transformation of that sample. Examples of transformations include: translation, rotation, blurring, color inversion, color jitter, gaussian noise, etc. (I’m using images as a concrete example here, but BYOL works with other data types, too.) We usually train using several different types of transformations, which can be applied together or independently. In general, if you want your model to be invariant under a particular transformation, then it should be included it in your training. BYOL最小化样本表征和该样本变换之后的表征间的距离。其中,不同变换类型包括0:平移、旋转、模糊、颜色反转、颜色抖动、高斯噪声等(我在此以图像操作来举例说明,但BYOL也可以处理其他数据类型)。至于是单一变换还是几种不同类型的联合变换,这取决于你自己,不过我一般会采用联合变换。但有一点需要注意,如果你希望训练的模型能够应对某种变换,那么用该变换处理训练数据时必要的。 Coding BYOL from ScratchLet’s start by coding the transformations. The BYOL authors use a particular set of transformations, which are similar to those used in SimCLR: 手把手教你编码BYOL 首先是数据转换增强的编码。BYOL的作者定义了一组类似于SimCLR的特殊转换: import random from typing import Callable, Tuple from kornia import augmentation as aug from kornia import filters from kornia.geometry import transform as tf import torch from torch import nn, Tensor class RandomApply(nn.Module): def __init__(self, fn: Callable, p: float): super().__init__() self.fn = fn self.p = p def forward(self, x: Tensor) -> Tensor: return x if random.random() > self.p else self.fn(x) def default_augmentation(image_size: Tuple[int, int] = (224, 224)) -> nn.Module: return nn.Sequential( tf.Resize(size=image_size), RandomApply(aug.ColorJitter(0.8, 0.8, 0.8, 0.2), p=0.8), aug.RandomGrayscale(p=0.2), aug.RandomHorizontalFlip(), RandomApply(filters.GaussianBlur2d((3, 3), (1.5, 1.5)), p=0.1), aug.RandomResizedCrop(size=image_size), aug.Normalize( mean=torch.tensor([0.485, 0.456, 0.406]), std=torch.tensor([0.229, 0.224, 0.225]), ), ) I chose to use Kornia for implementing the transformations — a great Python library with fully differentiable computer vision operations. You could use any other data augmentation/transformation library, or simply write your own. We don’t actually need differentiability for implementing BYOL. Next, we need an Encoder module. The Encoder is responsible for extracting features from the base model, and projecting those features into a lower-dimensional, latent space. We’ll implement it using a wrapper class, which allows us to easily use BYOL with any model — not just one that we hard-code into our scripts. There are two primary components. 上述代码通过Kornia实现数据转换,这是一个基于 PyTorch 的可微分的计算机视觉开源库。当然,你可以用其他开源库实现数据转换扩充,甚至是自己编写。实际上,可微分性对BYOL而言并没有那么必要。 接下来,我们编写编码器模块。该模块负责从基本模型提取特征,并将这些特征投影到低维隐空间。具体的,我们通过wrapper类实现该模块,这样我们可以轻松将BYOL用于任何模型,无需将模型编码到脚本。该类主要由两部分组成: Feature Extractor: collects the outputs from one of the last model layers. Projector: a linear layer, which projects outputs down lower dimensions. 特征抽取,获取模型最后一层的输出。 映射,非线性层,将输出映射到更低维空间。 The feature extraction is implemented using hooks. (If you’re not familiar with them, see my previous article How to Use PyTorch Hooks for an overview and tutorial.) Other than that, the wrapper is pretty straightforward. 特征提取通过hooks实现(如果你不了解hooks,推荐阅读我之前的介绍文章How to Use PyTorch Hooks)。除此之外,代码其他部分很容易理解。 from typing import Union def mlp(dim: int, projection_size: int = 256, hidden_size: int = 4096) -> nn.Module: return nn.Sequential( nn.Linear(dim, hidden_size), nn.BatchNorm1d(hidden_size), nn.ReLU(inplace=True), nn.Linear(hidden_size, projection_size), ) class EncoderWrapper(nn.Module): def __init__( self, model: nn.Module, projection_size: int = 256, hidden_size: int = 4096, layer: Union[str, int] = -2, ): super().__init__() self.model = model self.projection_size = projection_size self.hidden_size = hidden_size self.layer = layer self._projector = None self._projector_dim = None self._encoded = torch.empty(0) self._register_hook() @property def projector(self): if self._projector is None: self._projector = mlp( self._projector_dim, self.projection_size, self.hidden_size ) return self._projector # ---------- Methods for registering the forward hook ---------- # For more info on PyTorch hook, see: # https://towardsdatascience.com/how-to-use-pytorch-hooks-5041d777f904 def _hook(self, _, __, output): output = output.flatten(start_dim=1) if self._projector_dim is None: # If we haven't already, measure the output size self._projector_dim = output.shape[-1] # Project the output to get encodings self._encoded = self.projector(output) def _register_hook(self): if isinstance(self.layer, str): layer = dict([*self.model.named_modules()])[self.layer] else: layer = list(self.model.children())[self.layer] layer.register_forward_hook(self._hook) # ------------------- End hooks methods ---------------------- def forward(self, x: Tensor) -> Tensor: # Pass through the model, and collect 'encodings' from our forward hook! _ = self.model(x) return self._encoded BYOL contains two identical Encoder networks. The first is trained as usual, and its weights are updated with each training batch. The second (referred to as the “target” network) is updated using a running average of the first Encoder’s weights. During training, the target network is provided a raw training batch, and the other Encoder is given a transformed version of the same batch. Each network generates a low-dimensional, latent representation for their respective data. Then, we attempt to predict the output of the target network using a multi-layer perceptron. BYOL maximizes the similarity between this prediction and the target network’s output. BYOL包含两个相同的编码器网络。第一个编码器网络的权重随着每一训练批次进行更新,而第二个网络(称为“目标”网络)使用第一个编码器权重均值进行更新。在训练过程中,目标网络接收原始批次训练数据,而另一个编码器则接收相应的转换数据。两个编码器网络会分别为相应数据生成低维表示。然后,我们使用多层感知器预测目标网络的输出,并最大化该预测与目标网络输出之间的相似性。 Why include the multi-layer perceptron? If we want similar representations before/after transforming the data, shouldn’t we just compare the latent vectors directly? Actually, no. In that case, our network would quickly learn similar representations for all images by decreasing its weights to zero. Our model would have learned nothing at all. Instead, our MLP layer learns to recognize the data transformations and predict the target latent vector. The weights no longer collapse to zero, and we can continue learning self-consistent representations for our data! 也许有人会想,我们不是应该直接比较数据转换之前和之后的隐向量表征吗?为什么还有设计多层感知机?假设没有MLP层的话,网络可以通过将权重降低到零方便的使所有图像的表示相似化,可这样模型并没有学到任何有用的东西,而MLP层可以识别出数据转换并预测目标隐向量。这样避免了权重趋零,可以学习更恰当的数据表示! At the end of training, we discard the target network altogether. This leaves a single Encoder, which has been trained to generate self-consistent representations for all samples in the training data. This is exactly why BYOL works for self-supervised learning! Because the learned representations are self-consistent, they are (mostly) invariant under different transformations of the data. Similar examples have similar representations in the trained model! 训练结束后,舍弃目标网络编码器,只保留一个编码器,根据该编码器,所有训练数据可生成自洽表示。这正是BYOL能够进行自监督学习的关键!因为学习到的表示具有自洽性,所以经不同的数据变换后几乎保持不变。这样,模型使得相似示例的表示更加接近! Now, we need to write the BYOL training code. I chose to use PyTorch Lightning for this. It’s a fantastic library for deep learning projects/research written in PyTorch, which includes conveniences like multi-GPU training, experiment logging, model checkpointing, and mixed-precision training. (You can now even run PyTorch models on cloud TPUs with Lightning!) 接下来编写BYOL的训练代码。我选择使用Pythorch Lightning开源库,该库基于PyTorch,对深度学习项目非常友好,能够进行多GPU培训、实验日志记录、模型断点检查和混合精度训练等,甚至在cloud TPU上也支持基于该库运行PyTorch模型! from copy import deepcopy from itertools import chain from typing import Dict, List import pytorch_lightning as pl from torch import optim import torch.nn.functional as f def normalized_mse(x: Tensor, y: Tensor) -> Tensor: x = f.normalize(x, dim=-1) y = f.normalize(y, dim=-1) return torch.mean(2 - 2 * (x * y).sum(dim=-1)) class BYOL(pl.LightningModule): def __init__( self, model: nn.Module, image_size: Tuple[int, int] = (96, 96), hidden_layer: Union[str, int] = -2, projection_size: int = 256, hidden_size: int = 4096, augment_fn: Callable = None, beta: float = 0.99, **hparams, ): super().__init__() self.augment = default_augmentation(image_size) if augment_fn is None else augment_fn self.beta = beta self.encoder = EncoderWrapper( model, projection_size, hidden_size, layer=hidden_layer ) self.predictor = nn.Linear(projection_size, projection_size, hidden_size) self.hparams = hparams self._target = None # Perform a single forward pass, which initializes the 'projector' in our # 'EncoderWrapper' layer. self.encoder(torch.zeros(2, 3, *image_size)) def forward(self, x: Tensor) -> Tensor: return self.predictor(self.encoder(x)) @property def target(self): if self._target is None: self._target = deepcopy(self.encoder) return self._target def update_target(self): for p, pt in zip(self.encoder.parameters(), self.target.parameters()): pt.data = self.beta * pt.data + (1 - self.beta) * p.data # --- Methods required for PyTorch Lightning only! --- def configure_optimizers(self): optimizer = getattr(optim, self.hparams.get("optimizer", "Adam")) lr = self.hparams.get("lr", 1e-4) weight_decay = self.hparams.get("weight_decay", 1e-6) return optimizer(self.parameters(), lr=lr, weight_decay=weight_decay) def training_step(self, batch, *_) -> Dict[str, Union[Tensor, Dict]]: x = batch[0] with torch.no_grad(): x1, x2 = self.augment(x), self.augment(x) pred1, pred2 = self.forward(x1), self.forward(x2) with torch.no_grad(): targ1, targ2 = self.target(x1), self.target(x2) loss = (normalized_mse(pred1, targ2) + normalized_mse(pred2, targ1)) / 2 self.log("train_loss", loss.item()) return {"loss": loss} @torch.no_grad() def validation_step(self, batch, *_) -> Dict[str, Union[Tensor, Dict]]: x = batch[0] x1, x2 = self.augment(x), self.augment(x) pred1, pred2 = self.forward(x1), self.forward(x2) targ1, targ2 = self.target(x1), self.target(x2) loss = (normalized_mse(pred1, targ2) + normalized_mse(pred2, targ1)) / 2 return {"loss": loss} @torch.no_grad() def validation_epoch_end(self, outputs: List[Dict]) -> Dict: val_loss = sum(x["loss"] for x in outputs) / len(outputs) self.log("val_loss", val_loss.item()) Most of this is boilerplate code for interfacing with PyTorch Lightning. The important part happens in training_step, where all of the data transformations, feature projections, and similarity losses are computed. 上述代码部分源自Pythorch Lightning提供的示例代码。这段代码你尤其需要关注的是training_step,在此函数实现模型的数据转换、特征投影和相似性损失计算等。 Practical ExampleTime to see BYOL in action. As a practical example, we’ll be using the STL10 dataset. It’s perfect for unsupervised and self-supervised learning experiments because it contains a large number of unlabeled images, as well as labeled training and test sets. As described on the STL10 site: 实例说明 下文我们将在STL10数据集上对BYOL进行实验验证。因为该数据集同时包含大量未标记的图像以及标记的训练和测试集,非常适合无监督和自监督学习实验。STL10网站这样描述该数据集: The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training. The primary challenge is to make use of the unlabeled data (which comes from a similar but different distribution from the labeled data) to build a useful prior. STL-10数据集是一个用于研究无监督特征学习、深度学习、自学习算法的图像识别数据集。该数据集是对CIFAR-10数据集的改进,最明显的便是,每个类的标记训练数据比CIFAR-10中的要少,但在监督训练之前,数据集提供大量的未标记样本训练模型学习图像模型。因此,该数据集主要的挑战是利用未标记的数据(与标记数据相似但分布不同)来构建有用的先验知识。 Torchvision has convenient methods for loading STL10, so we don’t need to worry about downloading or pre-processing the data in any way. 通过Torchvision可以很方便的加载STL10,因此无需担心数据的下载和预处理。 from torchvision.datasets import STL10 from torchvision.transforms import ToTensor TRAIN_DATASET = STL10(root="data", split="train", download=True, transform=ToTensor()) TRAIN_UNLABELED_DATASET = STL10( root="data", split="train+unlabeled", download=True, transform=ToTensor() ) TEST_DATASET = STL10(root="data", split="test", download=True, transform=ToTensor()) As a baseline, we first perform supervised training, and then measure the accuracy of our trained model. We can write another (much simpler) Lightning module to accomplish this: 同时,我们使用监督学习方法作为基准模型,以此衡量本文模型的准确性。基线模型也可通过Lightning模块轻易实现: class SupervisedLightningModule(pl.LightningModule): def __init__(self, model: nn.Module, **hparams): super().__init__() self.model = model def forward(self, x: Tensor) -> Tensor: return self.model(x) def configure_optimizers(self): optimizer = getattr(optim, self.hparams.get("optimizer", "Adam")) lr = self.hparams.get("lr", 1e-4) weight_decay = self.hparams.get("weight_decay", 1e-6) return optimizer(self.parameters(), lr=lr, weight_decay=weight_decay) def training_step(self, batch, *_) -> Dict[str, Union[Tensor, Dict]]: x, y = batch loss = f.cross_entropy(self.forward(x), y) self.log("train_loss", loss.item()) return {"loss": loss} @torch.no_grad() def validation_step(self, batch, *_) -> Dict[str, Union[Tensor, Dict]]: x, y = batch loss = f.cross_entropy(self.forward(x), y) return {"loss": loss} @torch.no_grad() def validation_epoch_end(self, outputs: List[Dict]) -> Dict: val_loss = sum(x["loss"] for x in outputs) / len(outputs) self.log("val_loss", val_loss.item()) Now, training with PyTorch Lightning is pretty straightforward. Just create DataLoader objects for the training and test sets, and specify the model we want to train. I chose to train for 25 epochs with a learning rate of 1e-4. 可以看到,使用Pythorch Lightning可以方便的构建并训练模型。只需为训练集和测试集创建DataLoader对象,将其导入需要训练的模型即可。本实验中,epoch设置为25,学习率为1e-4。 from os import cpu_count from torch.utils.data import DataLoader from torchvision.models import resnet18 model = resnet18(pretrained=True) supervised = SupervisedLightningModule(model) trainer = pl.Trainer(max_epochs=25, gpus=-1, weights_summary=None) train_loader = DataLoader( TRAIN_DATASET, batch_size=128, shuffle=True, drop_last=True, ) val_loader = DataLoader( TEST_DATASET, batch_size=128, ) trainer.fit(supervised, train_loader, val_loader) def accuracy(pred: Tensor, labels: Tensor) -> float: return (pred.argmax(dim=-1) == labels).float().mean().item() acc = sum([accuracy(model(x), y) for x, y in val_loader]) / len(val_loader) print(f"Accuracy: {acc:.3f}") # Accuracy: 0.852 We achieve around 85% accuracy — not bad, given a very small model like ResNet18. But naturally, we’re not happy with 85% accuracy, because we can do better! For the next experiment, we’ll pre-train the ResNet18 model using BYOL. I chose to train for 50 epochs, using a learning rate of 1e-4 again. (This is by far the most computationally intensive step. It takes roughly 45 minutes in a standard Colab notebook with K80 GPU.) 经训练,仅通过一个非常小的模型ResNet18就取得约85%的准确率。但实际上,我们还可以做得更好! 接下来,我们使用BYOL对ResNet18模型进行预训练。在这次实验中,我选择epoch为50,学习率依然是1e-4。注:该过程是本文代码耗时最长的部分,在K80 GPU的标准Colab中大约需要45分钟。 model = resnet18(pretrained=True) byol = BYOL(model, image_size=(96, 96)) trainer = pl.Trainer( max_epochs=50, gpus=-1, # Batch size of 2048 matches the BYOL paper accumulate_grad_batches=2048 // 128, weights_summary=None, ) train_loader = DataLoader( TRAIN_UNLABELED_DATASET, batch_size=128, shuffle=True, drop_last=True, ) trainer.fit(byol, train_loader, val_loader) Then, we extract the newly trained ResNet18 model, and run supervised training again. (To ensure that forward hooks from BYOL are removed, we instantiate a new model and copy the trained state dictionary over to it.) 然后,我们使用新的ResNet18模型重新进行监督学习。(为彻底清除BYOL中的前向hook,我们实例化一个新模型,在该模型引入经过训练的状态字典。) # Extract the state dictionary, initialize a new ResNet18 model, # and load the state dictionary into the new model. # # This ensures that we remove all hooks from the previous model, # which are automatically implemented by BYOL. state_dict = model.state_dict() model = resnet18() model.load_state_dict(state_dict) supervised = SupervisedLightningModule(model) trainer = pl.Trainer( max_epochs=25, gpus=-1, weights_summary=None, ) train_loader = DataLoader( TRAIN_DATASET, batch_size=128, shuffle=True, drop_last=True, ) trainer.fit(supervised, train_loader, val_loader) acc = sum([accuracy(model(x), y) for x, y in val_loader]) / len(val_loader) print(f"Accuracy: {acc:.3f}") # Accuracy: 0.877 Just like that, we’ve boosted the model accuracy by roughly 2.5%, up to 87.7% overall! It required a decent amount of code (~300 lines) and some helpful libraries, but this was simpler than many other self-supervised methods. (For comparison, take a look at the official SimCLR or SwAV repositories.) And the entire experiment takes less than an hour, even when using modest hardware provided for free through Colab. 通过这种方式,模型准确率提高了约2.5%,达到了87.7%!虽然该方法需要更多的代码(大约300行)以及一些库的支撑,但相比其他自监督方法仍显得简洁。作为对比,可以看下官方的SimCLR或SwAV是多么复杂。而且,本文具有更快的训练速度,即使是Colab的免费GPU,整个实验也不到一个小时。 ConclusionsThere are some very interesting takeaways here. First (and most obviously), BYOL is a pretty cool self-supervised method, which can maximize your model performance by leveraging unlabeled data. What’s even more interesting is that BYOL outperformed pre-trained ResNet18, since all ResNet models are pre-trained using ImageNet. STL10 is a small subset of ImageNet with all images downsized from 224x224 a resolution of 96x96. Because of the change in resolution, we need self-supervised learning to recover some of the model performance. The small training set provided in STL10 is just not enough to accomplish that alone. ML practitioners often rely too heavily on pre-trained weights in models like ResNet. They’re certainly useful, but they aren’t necessarily well-suited for other data domains — even very similar data to ImageNet, such as STL10. For that reason, I hope the next few years lead to broad adoption of self-supervised methods in deep learning workflows. 结论 本文要点总结如下。首先也是最重要的,BYOL是一种巧妙的自监督学习方法,可以利用未标记的数据来最大限度地提高模型性能。此外,由于所有ResNet模型都是使用ImageNet进行预训练的,因此BYOL的性能优于预训练的ResNet18。STL10是ImageNet的一个子集,所有图像都从224x224像素缩小到96x96像素。虽然分辨率发生改变,我们希望自监督学习能避免这样的影响,表现出较好性能,而仅仅依靠STL10的小规模训练集是不够的。 类似ResNet这样的模型中,ML从业人员过于依赖预先训练的权重。虽然这在一定情况下是很好的选择,但不一定适合其他数据,哪怕在STL10这样与ImageNet高度相似的数据中表现也不如人意。因此,我迫切希望将来在深度学习的研究中,自监督方法能够获得更多的关注与实践应用。 参考资料https://arxiv.org/pdf/2006.07733.pdf https://arxiv.org/pdf/2006.10029v2.pdf https://github.com/fkodom/byol https://github.com/lucidrains/byol-pytorch https://github.com/google-research/simclr http://image-net.org/ https://cs.stanford.edu/~acoates/stl10/

1

图像分类

AI小助手·2020-11-18 17:13 1 阅读 48
文章一种用于曝光恢复的无参考的深度曲线估计算法
章理登 2020-11-16 简介文章提出一种增强方法无参考的深度曲线估计—Zero-Reference Deep Curve Estimation。它不需要成对数据,通过4个无监督损失进行约束。文章方法训练了一个DCE-Net,它给出了几张不同的参数图,其中参数图的每个参数值表示不同映射函数。 Model上图为网络基本结构。其中输入图像输入到DCE-Net中获得多张A图,这些A图会用于指导后续的图像增强过程。输入的每张图像都会经过n次映射从而获得最终结果,每次映射过程都根据不同的A图进行。 映射过程如下所示:I(x)为输入的像素点,而LE(I(x);α)为输出结果。其中α∈-[1,1],表示A图中某一个位置上的值。 损失函数文章使用了一下四种损失函数:该损失函数称为空间一致化损失,其目标在于约束相邻的图像块在增强前后差距保持一致。 该损失函数称为曝光控制损失,它是图像调整曝光的主要动力,它迫使图像在局部块中达到E的亮度,文章在实验中设置为0.6 该损失函数称为颜色一致性损失,它遵守图像灰度一致性,它希望图像在三通道上的均值接近。 该损失函数称为估计平滑损失,它统计了每张A图的水平方向和垂直方向上的梯度,以要求他们能够平滑。 其中Wcol权重为0.5,Wtva为20上图展示了各个损失函数的作用。 结果展示 总结和分析 文章方法难以抑制噪声,或者应该说文章根本没有考虑噪声的抑制。但对于低照度图像的增强,应当考虑。黑暗物体的边缘存在一定的光晕,我认为这是由于曝光控制损失造成的,它要求局部像素均值达到一定阈值,但由于估计平滑损失的权重为20,因此光晕现象得到一定的控制。文章使用的方法非常讨巧,一方面使用映射函数进行多轮微调,另一方面使用一个轻量级网络估计多张A图作为参数,整体上没有使用成对的数据集,避免了成对数据集的约束。但也造成了噪声和色偏等部分问题。如果考虑进一步的提高可以考虑:自适应循环次数;噪声;黑暗图像处理不佳;

1

深度学习

flyai会员1587360332·2020-11-17 16:43 1 阅读 56
文章这四种超参调试技术, 你都了解了吗?
FlyAI文献翻译英文原文::4 Hyper-Parameter Tuning Techniques 标签 :深度学习 图片来自SigOpt Introduction Wikipedia states that “Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm” One of the toughest parts in ML workflow is finding the best hyperparameters for the model. Performance of ML models is directly related with Hyper-parameter’s. The more you tune the hyperparameters, the better model you get. Tuning Hyper-parameters could be really tedious, difficult and is more of an art than science. 介绍 Wikipedia 指出:“超参数优化或调整是为学习算法选择一组最佳超参数的问题” ML工作流程中最困难的部分之一是为模型找到最佳的超参数。 ML模型的性能与超参数的性能直接相关。 您对超参数的调整越多,获得的模型就越好。 调整超参数可能真的很繁琐,困难,并且比科学更是一门艺术。 Hyper-parametersHyper-parameters are the parameters used to control the behavior of the algorithm while building the model. These parameters cannot be learned from the regular training process. They need to be assigned before training of the model. 超参数超参数是用于在构建模型时控制算法行为的参数。 这些参数无法从常规培训过程中学习。 在训练模型之前,需要先分配它们。 A sample list of Hyper-parameters by Dr.Mukesh Rao Dr.Mukesh Rao的超参数示例列表 Table of Content Traditional or Manual Tuning Grid Search Random Search Bayesian Search 目录. 传统或手动调优 . 网格搜索 . 随机搜索 . 贝叶斯搜索 Traditional or Manual TuningIn Traditional Tuning, we manually check with random sets of hyper-parameters by training the algorithm and select the best set of parameters that fits our objective. Let’s see with code 传统或手动调优统调优中,我们通过训练算法来手动检查随机的超参数集,并选择适合我们目标的最佳参数集。 我们看一下代码 Drawbacks: No guarantee to get the best parameter combination. It’s a trial and error approach. So, it consumes more time. 缺点: . 无法保证获得最佳的参数组合。 . 这是一种反复试验的方法。 因此,它消耗更多时间。 Grid SearchGrid Search is a basic hyper-parameter tuning technique. It is similar to manual tuning where it builds a model for each permutation of all the given hyperparameter values specified in the grid, evaluates and selects the best model. Consider the above example where it has two hyperparameters k_value =[2,3,4,5,6,7,8,9,10] & algorithm = [‘auto’ , ’ball_tree’ , ’kd_tree’ , ’brute’], in this case it builds a total of 9*4 = 36 different models. 网络搜索 搜索是一种基本的超参数调整技术。 它类似于手动调整,在手动调整中,它为网格中指定的所有给定超参数值的每个排列构建一个模型,评估并选择最佳模型。 考虑上面的示例,它有两个超参数k_value = [2,3,4,5,6,7,8,9,10]&算法= [‘auto’,’ball_tree’,’kd_tree’,’brute’] ,在这种情况下,它总共构建了9 * 4 = 36个不同的模型。 Image from SigOpt Let’s understand how sklearn’s GridSearchCV works, 图片来自 SigOpt 我们了解sklearn的GridSearchCV的工作原理 Drawback: As it tries every combination of the hyper-parameters and choose the best combination based on the cross-validation score, this makes GridsearchCV extremely slow. 缺点: 当它尝试超参数的每种组合并根据交叉验证得分选择最佳组合时,这会使GridsearchCV极其缓慢。 3. Random SearchThe motivation to use random search in place of grid search is that for many cases, all the hyperparameters may not be equally important. Random Search selects random combination of parameters from the hyper-parameter space, parameters will be selected with fixed numbers of iterations given by n_iter. Random search has empirically been demonstrated to give better results than Grid Search. 3.随机搜索 使用随机搜索代替网格搜索的动机是,在许多情况下,所有超参数可能都没有同等重要。 随机搜索从超参数空间中选择参数的随机组合,参数将以n_iter给出的固定迭代次数进行选择。 实验证明,随机搜索比网格搜索可提供更好的结果。 Image from SigOpt Let’s understand how sklearn’s RandomizedSearchCV works, 图片来自 SigOpt 让我们了解sklearn的RandomizedSearchCV的工作原理, Drawback: The problem with random search is it doesn’t guarantee to give the best parameter combination. 缺点: 随机搜索的问题在于不能保证给出最佳的参数组合。 Bayesian SearchBayesian optimization falls in a class of optimization algorithms called sequential model-based optimization (SMBO) algorithms. These algorithms use previous observations of the loss f , to determine the next (optimal) point to sample f for. The algorithm can roughly be outlined as follows. Using previously evaluated points X1:n, compute a posterior expectation of what the loss f looks like. Sample the loss f at a new point Xnew, that maximizes some utility of the expectation of f. The utility specifies which regions of the domain of f are optimal to sample from. These steps are repeated until some convergence criterion is met. 贝叶斯搜索叶斯优化属于一类优化算法,称为基于顺序模型的优化(SMBO)算法。 这些算法使用对损耗f的先前观测值来确定采样f的下一个(最佳)点。 该算法大致可概述如下。 . 使用先前评估的点X1:n,计算损失f的后验期望。 在新点Xnew上对损耗f进行采样,这将最大化f期望值的效用。 实用程序指定f的哪个区域是最佳样本。 重复这些步骤,直到满足某些收敛标准为止。 Image from SigOpt Let’s understand this with scikit-optimize’s BayesSearchCV Installation : pip install scikit-optimize 图片来自 SigOpt 让我们通过scikit-optimize的BayesSearchCV了解这一点。 安装:pip install scikit-optimize Another similar library to implement Bayesian Search is bayesian-optimization Installation : pip install bayesian-optimization Drawback: It takes a dozen or so samples to get a good surrogate surface in 2 or 3 dimensions of search space; increasing dimensionality of the search space requires yet more samples. 另一个实现贝叶斯搜索的相似库是贝叶斯优化 安装: pip install bayesian-optimization 缺点: 在2到3维的搜索空间中,需要十几个样本才能获得良好的代理曲面; 搜索空间维数的增加需要更多的样本。 ConclusionThere is always a trade-off between the guarantee to identify the best combination of parameters and the computation time. If your hyper-parameter space (number of hyper-parameters) is very large, then use Random Search to find the potential combination of hyper-parameters and then use Grid Search in that locality (potential combination of hyper-parameters) to select the optimal features. 结论在确定参数的最佳组合和计算时间之间始终要进行取舍。如果您的超参数空间(超参数的数量)非常大,请使用随机搜索找到超参数的潜在组合,然后在该位置使用网格搜索(超参数的潜在组合)选择最佳特征。

1

深度学习

AI小助手·2020-11-16 15:30 0 阅读 74
文章潜图学习:流形学习新视角
FlyAI文献翻译英文原文::Latent graph neural networks: Manifold learning 2.0? 标签 :图神经网络 TL;DR: Graph neural networks exploit relational inductive biases for data that come in the form of a graph. However, in many cases we do not have the graph readily available. Can graph deep learning still be applied in this case? In this post, I draw parallels between recent works on latent graph learning and older techniques of manifold learning. The past few years have witnessed a surge of interest in developing ML methods for graph-structured data. Such data naturally arises in many applications such as social sciences (e.g. the Follow graph of users on Twitter or Facebook), chemistry (where molecules can be modelled as graphs of atoms connected by bonds), or biology (where interactions between different biomolecules are often modelled as a graph referred to as the interactome). Graph neural networks (GNNs), which I have covered extensively in my previous posts, are a particularly popular method of learning on such graphs by means of local operations with shared parameters exchanging information between adjacent nodes. 图神经网络利用关系归纳偏差来处理图形式的数据。 但是,在许多情况下,我们没有图可用。 在这种情况下,仍然可以应用图深度学习吗? 在这篇文章中,我将有关潜图学习的最新工作与较早的流形学习技术进行了比较。 过去的几年里,面向图结构数据开发机器学习方法的兴趣激增。这样的数据自然地出现在很多应用领域中,例如社会科学(像Twitter或Facebook上的“用户关注图”),化学(其中分子可以被建模为通过键连接的原子图)或生物学(不同生物分子之间的相互作用被建模为称为“交互组”的图表)。 图神经网络(GNN)是我在以前的文章中广泛讨论的一种特别流行的方法,它是通过局部操作和共享参数在相邻节点之间交换信息来学习此类图的。 In some settings, however, we do not have the luxury of being given a graph to work with as input. This is a typical situation for many biological problems, where graphs such as protein-to-protein interaction are only partially known in the best case, as the experiments by which interactions are discovered are expensive and noisy. We are therefore interested in inferring the graph from the data and applying a GNN on it [1] — I call this setting “latent graph learning” [2]. The latent graph might be application-specific and optimised for the downstream task. Furthermore, sometimes such a graph might be even more important than the task itself, as it may convey important insights about the data and offer a way to interpret the results. Away of thinking of latent graph learning is that of a graph with an empty edge set. In this setting, the input is provided as a point cloud in some high-dimensional feature space. Unlike methods for deep learning on sets such as PointNet [3], which apply some shared learnable point-wise function to each point, we also seek to pass information across points. This is done by message passing on a graph constructed from the point features themselves. 但是,在某些情况下,我们不能得到一个可作为输入使用的图形,这是许多生物学问题的典型情况。在最好的情况下,仅部分了解蛋白质-蛋白质相互作用等图,因为发现相互作用的实验既昂贵又嘈杂。 因此,我们有兴趣根据数据推断图形并对其应用GNN [1]-我称之为“潜在图形学习” [2]。 潜在图可能是特定于应用程序的,并针对下游任务进行了优化。 此外,有时这样的图形可能比任务本身更为重要,因为它可以传达有关数据的重要见解并提供一种解释结果的方法。 潜图学习的思维方式是带有空边缘集的图。在这种设定下,输入被提供为某些高维特征空间中的点云。 与诸如PointNet [3]之类的在集合上进行深度学习的方法不同,该方法将共享的可学习的逐点函数应用于每个点,我们还寻求跨点传递信息。 这是通过将消息传递到由点要素本身构造的图形上来完成的。 The first architecture of this kind, Dynamic Graph CNN (DGCNN) [4], was developed by Yue Wang from MIT, with whom I had the pleasure to collaborate during my sabbatical at that institution. Motivated by problems in computer graphics dealing with the analysis of 3D point clouds, the idea was to use the graph as a coarse representation of the local smooth manifold structure underlying a point cloud. A key observation of Yue was that the graph does not need to stay the same throughout the neural network, and in fact it can and should be updated dynamically — hence the name of the method. The following figure from our paper illustrates why this might be useful in computer graphics problems: 麻省理工学院的王悦开发了这种第一种架构,即动态图CNN(DGCNN)[4]。(在我休假期间,我很高兴与他合作。)受计算机图形学中涉及3D点云分析的问题的影响,其想法是将图形用作点云下局部光滑流形结构的粗略表示。 Yue的一个主要观察结果是,该图不需要在整个神经网络中保持不变,并且实际上并且可以并且应该动态地进行更新-因此该方法的名称。我们论文的下图说明了为什么这可能在计算机图形问题中有用: Dynamic Graph CNNs construct on the fly a k-nearest neighbour graph that is used for feature diffusion. The graph is task-dependent and is updated after each layer. This figure (taken from [4]) depicts the distance from the red point (yellow colours represents closer points), showing that on a segmentation task the graphs in deeper layers capture the semantic rather than geometric relations such as pairs of wings, engines, etc. 动态图CNN可以动态构建一个k近邻图,用于特征扩散。该图依赖于任务,并在每一层之后进行更新。该图(取自[4])描述了距红点的距离(黄色代表较近的点),表明在分割任务中,较深层的图形捕获了语义而不是几何关系,例如成对的机翼、引擎等等。 One of the limitations of DGCNNs was that the same space is used to construct the graph and the features on that graph. In a recent work with Anees Kazi from TUM and my postdoc Luca Cosmo, we proposed a new architecture called Differentiable Graph Module (DGM) [5] extending DGCNN by decoupling the graph and feature construction, as shown in the following figure: DGCNN的局限性之一是使用相同的空间来构造图和该图上的特征。在TUM的Anees Kazi和我的博士后Luca Cosmo的近期工作中,我们提出了一种称为可微图模块(DGM)的新架构[5],该架构通过将图和特征构造分离来扩展DGCNN,如下图所示: Differentiable Graph Module (DGM) provides a mechanism to construct the graph and the features for the diffusion on the graph from the input data. Figure from [5]. 可微图模块(DGM)提供了一种机制,可根据输入数据构造图和用于在图上进行扩散的特征。上图来自文中图[5]。 DGM showed impressive results when applied to problems from the medical domain, such as disease prediction from brain imaging data. In these tasks, we are provided with the electronic health records of multiple patients, including demographic features (such as age, sex, etc) and brain imaging features, and attempt to predict whether a patient suffers from a neurological disease. Previous works showed the application of GNNs to such tasks using diffusion on a “patient graph” constructed by hand from the demographic features [6]. DGM offers the advantage of learning the graph, which possibly conveys insight into how certain features depend on each other for the specific diagnosis task. As a bonus, DGM also beat DGCNN in its game of point cloud classification, albeit only slightly. DGCNN and DGM bear conceptual similarity to a family of algorithms called manifold learning or non-linear dimensionality reduction, which were extremely popular in machine learning when I was a student in the 2000s, and are still used for data visualisation. The assumption underlying manifold learning methods is that of the data having an intrinsic low-dimensional structure. Though the data can be represented in a space of hundreds or even thousands of dimensions, it only has a few degrees of freedom, as shown in the following example: 将DGM应用于医学领域的问题时,例如从大脑成像数据进行疾病预测,其结果令人印象深刻。在这些任务中,我们将获得多个患者的电子健康记录,包括人口统计特征(例如年龄,性别等)和大脑成像特征,并尝试预测患者是否患有神经系统疾病。先前的工作表明,通过在人口统计特征上手工构建的“患者图”上进行扩散,GNN可以应用于此类任务[6]。 DGM提供了学习图形的优势,这可能传达了对于特定诊断任务某些功能如何相互依赖的见解。作为奖励,DGM在点云分类游戏中也击败了DGCNN,尽管只是稍有下降。 DGCNN和DGM在概念上与称为流形学习或非线性降维的一系列算法相似,这些算法在我2000年代当学生时在机器学习中非常流行,并且仍用于数据可视化。多种学习方法的基础假设是具有固有的低维结构的数据。尽管可以在数百个甚至数千个维的空间中表示数据,但它只有几个自由度,如以下示例所示: While the hand images in this dataset are high-dimensional (64x64 pixels result in 4096 dimensions), they are intrinsically low-dimensional and can be explained by just two degrees of freedom: wrist rotation and finger extension. Manifold learning algorithms are able to capture this low-dimensional intrinsic structure of the dataset and represent it in the Euclidean space. Figure from the Isomap paper [9] supplementary materials. 尽管此数据集中的手形图像是高维的(64x64像素可产生4096个尺寸),但它们本质上是低维的,并且可以通过两个自由度来解释:腕部旋转和手指伸展。流形学习算法能够捕获数据集的这种低维固有结构并将其表示在欧几里得空间中。图来自Isomap论文[9]的补充材料。 The purpose of manifold learning is to capture these degrees of freedom (by reconstructing the underlying “manifold”, hence the name [7]) and reduce the dimensionality of the data to its intrinsic dimension. The important difference from linear dimensionality reduction such as PCA is that, due to the non-Euclidean structure of the data, there might be no possibility to recover the manifold by means of a linear projection [8]: 多重学习的目的是捕获这些自由度(通过重构底层的“流形”,因此得名[7])并将数据的维数减少到其固有维数。与线性降维(例如PCA)的重要区别在于,由于数据的非欧几里德结构,可能无法通过线性投影来恢复流形[8]: Linear dimensionality reduction (left) vs manifold learning. The “Swiss roll surface” (coined by Joshua Tenenbaum and shown here in its 1D incarnation) is a common example in manifold learning papers. 线性降维(左)与流形学习。 “瑞士卷表面”(由约书亚·特南鲍姆Joshua Tenenbaum)铸造,并以一维形式显示在此处)是多种学习论文中的常见示例。 Manifold learning algorithms vary in the way they approach the recovery of the “manifold”, but share a common blueprint. First, they create a representation of the data, which is typically done by constructing a k-nearest neighbour graph capturing its local structure. Second, they compute a low-dimensional representation (embedding) of the data trying to preserve the structure of the original data. This is where most manifold learning methods differ. For example, Isomap [9] tries to preserve the graph geodesic distance, Locally Linear Embedding [10] finds a local representation of adjacent points, and Laplacian eigenmaps [11] use the eigenfunctions of the graph Laplacian operator as the low-dimensional embedding. This new representation “flattens” the original non-Euclidean structure into a Euclidean space that is easier to deal with. Third, once the representation is computed, a machine learning algorithm (typically clustering) is applied to it. 流形学习算法在处理“流形”恢复的方式上有所不同,但有一个共同的蓝图。首先,它们创建数据的表示形式,通常是通过构造捕获其局部结构的k最近邻图来完成的。其次,它们计算数据的低维表示(嵌入),以尝试保留原始数据的结构。这是大多数多种学习方法不同的地方。例如,Isomap [9]尝试保留图的测地距离,Locally Linear Embedding [10]查找相邻点的局部表示,而Laplacian特征图[11]使用图Laplacian算子的特征函数作为低维嵌入。这种新的表示形式将原始非欧几里得结构“展平”到了一个更易于处理的欧几里德空间中。第三,一旦计算出表示,就对其应用机器学习算法(通常为聚类)。 A blueprint for many manifold learning methods: first, the data is represented as a graph. Second, a low-dimensional embedding of this graph is computed. Third, an ML algorithm is applied on this low-dimensional representation. 许多流形学习方法的蓝图:首先,数据以图形表示。其次,计算该图的低维嵌入。第三,在该低维表示上应用了ML算法。 One of the challenges is that the construction of the graph is decoupled from the ML algorithm, and sometimes delicate parameter tuning (e.g. the number of neighbours or the neighbourhood radius) is needed in order to figure out how to build the graph to make the downstream task work well. Perhaps a far more serious drawback of manifold learning algorithms is that data rarely presents itself as low-dimensional in its native form. When dealing with images, for example, various handcrafted feature extraction techniques had to be used as pre-processing steps. Graph deep learning offers a modern take on this process, by replacing this three-stage process outlined above with a single graph neural network. In dynamic graph CNNs or DGM, for instance, the construction of the graph and the learning are part of the same architecture: 挑战之一是图的构造与ML算法脱钩,有时需要进行精细的参数调整(例如,邻居数或邻域半径)以弄清楚如何构建图以制作下游图、任务是否运行良好。流形学习算法的一个更严重的缺点是,数据很少以其本机形式呈现为低维。例如,在处理图像时,必须使用各种手工特征提取技术作为预处理步骤。 图深度学习通过用单个图神经网络代替上面概述的这个三阶段过程,为该过程提供了现代的方法。例如,在动态图CNN或DGM中,图的构造和学习是同一体系结构的一部分: Latent graph learning can be regarded as a modern setting of the manifold learning problem where the graph is learned as part of an end-to-end GNN pipeline optimised for some downstream task. 潜在图学习可以看作是流形学习问题的现代设置,其中图被学习为针对某些下游任务而优化的端到端GNN管道的一部分。 The appeal of this approach is the possibility to combine the treatment of individual data points and the space in which they reside in the same pipeline. In the example of images, one could use traditional CNNs to extract the visual features from each image and use a GNN to model the relations between them. This approach was used in the work of my PhD student Jan Svoboda: he proposed a graph-based regularisation layer (called PeerNet) for CNNs that allows to exchange information between multiple images [12]. PeerNets bear similarity to non-local means filters [13] in the way they aggregate information from multiple locations, with the main difference that the aggregation happens across multiple images rather than a single one. We showed that such a regularisation dramatically reduces the effect of adversarial perturbations to which standard CNNs are highly susceptible [14]. 这种方法的吸引力在于,可以将单个数据点的处理方式和它们在同一管道中所驻留的空间相结合。在图像示例中,可以使用传统的CNN从每幅图像中提取视觉特征,并使用GNN对它们之间的关系进行建模。我的博士生Jan Svoboda的工作中使用了这种方法:他为CNN提出了一个基于图的正则化层(称为PeerNet),该层允许在多个图像之间交换信息[12]。 PeerNet与非本地均值过滤器[13]的相似之处在于它们从多个位置聚合信息的方式不同,主要区别在于聚合发生在多个图像而不是单个图像上。我们表明,这种正则化极大地降低了标准CNN极易受到对抗性干扰的影响[14]。 PeerNets are a graph-based regularisation layer in standard CNNs aggregating similar pixels from multiple images in order to reduce the sensitivity to adversarial perturbations. Figure from [12]. PeerNets是标准CNN中基于图的正则化层,可聚合来自多个图像的相似像素,从而降低对对抗性扰动的敏感性。图[12]。 There are many other interesting applications of latent graph learning. One is few-shot learning, where graph-based techniques can help generalise from a few examples. Few-shot learning is becoming increasingly important in computer vision where the cost of data labelling is significant [5]. Another field is biology, where one often observes experimentally expression levels of biomolecules such as proteins and tries to reconstruct their interaction and signalling networks [15]. Third problem is the analysis of physical systems where a graph can describe interactions between multiple objects [16]. In particular, high-energy physicists dealing with complex particle interactions have recently been showing keen interest in graph-based approaches [17]. Last but not least are problems in NLP, where graph neural networks can be seen as generalisations of the transformer architecture. Many of the mentioned problems also raise questions on incorporating priors on the graph structure, which is still largely open: for example, one may wish to force the graph to obey certain construction rules or be compatible with some statistical model [18]. I believe that latent graph learning, while not entirely new, offers a new perspective on old problems. It is for sure an interesting setting of graph ML problems, providing a new playground for GNN researchers. 潜在图学习还有许多其他有趣的应用。一种是少拍学习,其中基于图的技术可以帮助总结一些示例。在数据标记成本显着的计算机视觉中,快速学习变得越来越重要[5]。生物学的另一个领域是,人们经常通过实验观察诸如蛋白质之类的生物分子的表达水平,并试图重建其相互作用和信号网络[15]。第三个问题是对物理系统的分析,其中图形可以描述多个对象之间的相互作用[16]。特别是,处理复杂粒子相互作用的高能物理学家最近对基于图的方法表现出了浓厚的兴趣[17]。最后但并非最不重要的是NLP中的问题,其中图神经网络可以看作是变压器体系结构的概括。许多提到的问题也引发了关于将先验值合并到图结构中的问题,而图结构仍然在很大程度上是开放的:例如,人们可能希望强迫图服从某些构造规则或与某些统计模型兼容[18]。 我相信潜图学习虽然不是全新的,但它为旧问题提供了新的视角。毫无疑问,这是一个有趣的图ML问题设置,为GNN研究人员提供了一个新的场所。 [1] A slightly different but related class of methods seeks to decouple the graph provided as input from the computational graph used for message passing in graph neural networks, see e.g. J. Halcrow et al. Grale: Designing networks for graph learning (2020). arXiv:2007.12002. There are multiple reasons why one may wish to do it, one of which is breaking the bottlenecks related to the exponential growth of the neighbourhood size in some graphs, as shown by U. Alon and E. Yahav, On the bottleneck of graph neural networks and its practical implications (2020). arXiv:2006.05205. [1]略有不同但相关的一类方法试图将作为输入提供的图与用于图神经网络中消息传递的计算图解耦。 J. Halcrow等。 Grale:设计图学习网络(2020年)。 arXiv:2007.12002。我们可能有多种理由希望这样做,其中之一是打破某些图中与邻域大小指数增长有关的瓶颈,如U. Alon和E. Yahav所著,论图神经网络的瓶颈及其实际意义(2020年)。 arXiv:2006.05205。 [2] Problems of reconstructing graphs underlying some data were considered in the signal processing context in the PhD thesis of Xiaowen Dong, in whose defence committee I took part in May 2014, just a few days before the birth of my son. X. Dong et al. Learning graphs from data: A signal representation perspective (2019), IEEE Signal Processing Magazine 36(3):44–63 presents a good summary of this line of work. A more recent incarnation of these approaches from the perspective of network games is the work of Y. Leng et al. Learning quadratic games on networks (2020). Proc. ICML, on whose PhD committee at MIT I was earlier this year. [2] 2014年5月,就在我儿子出生几天之前,我在董晓文的国防委员会的博士学位论文中就信号处理方面考虑了重构一些数据的图形问题。 X.Dong等。从数据中学习图表:从信号表示的角度(2019年),IEEE信号处理杂志36(3):44–63很好地总结了这一工作领域。从网络游戏的角度来看,这些方法的更新版本是Y. Leng等人的工作。在网络上学习二次游戏(2020)。进程ICML,今年早些时候曾在麻省理工学院的博士学位委员会任职。 [3] C. Qi et al. PointNet: Deep learning on point sets for 3D classification and segmentation (2017), Proc. CVPR. PointNet is an architecture for deep learning on sets, where a shared function is applied to the representation of each point, and can be considered as a trivial case of a GNN applied to a graph with empty edge set. [3] C. Qi et al. PointNet: Deep learning on point sets for 3D classification and segmentation (2017), Proc. CVPR. PointNet是一种用于集合的深度学习的体系结构,其中将共享功能应用于每个点的表示,并且可以视为将GNN应用于带有空边缘集的图的琐碎情况。 [4] Y. Wang et al. Dynamic graph CNN for learning on point clouds (2019). ACM Trans. Graphics 38(5):146. This paper has become quite popular in the computer graphics community and is often used as a baseline for point cloud methods. Ironically, it was rejected from SIGGRAPH in 2018 and was presented at the same conference only two years later after having gathered over 600 citations. [4] Y. Wang et al. Dynamic graph CNN for learning on point clouds (2019). ACM Trans. Graphics 38(5):146.本文已在计算机图形社区中变得非常流行,并且经常用作点云方法的基准。具有讽刺意味的是,它在2018年被SIGGRAPH拒绝了,并且在收集了600多次引用之后仅在两年后的同一个会议上被提出。 [5] A. Kazi et al., Differentiable Graph Module (DGM) for graph convolutional networks (2020) arXiv:2002.04999. We show multiple applications, including medical imaging, 3D point cloud analysis, and few shot learning. See also our paper L. Cosmo et al. Latent patient network learning for automatic diagnosis (2020). Proc. MICCAI, focusing on a medical application of this method. Anees was a visiting PhD student in my group at Imperial College in 2019. [5] A. Kazi et al., Differentiable Graph Module (DGM) for graph convolutional networks (2020) arXiv:2002.04999. 我们展示了多种应用,包括医学成像,3D点云分析和少量镜头学习。另请参见我们的论文L. Cosmo等。用于自动诊断的潜在患者网络学习(2020)。进程MICCAI,专注于此方法的医学应用。 Anees于2019年在我所在的帝国大学(Imperial College)小组中访问博士生。 [6] To the best of my knowledge, the first use of GNNs for brain disease prediction is by S. Parisot et al. Disease prediction using graph convolutional networks: application to autism spectrum disorder and Alzheimer’s disease (2017). Proc. MICCAI. The key drawback of this approach was a handcrafted construction of the graph from demographic features. [6]据我所知,S。Parisot等人首次将GNN用于脑疾病预测。使用图卷积网络进行疾病预测:应用于自闭症谱系障碍和阿尔茨海默氏病(2017)。进程MICCAI。这种方法的主要缺点是根据人口统计特征手工绘制图形。 [7] Formally speaking, it is not a “manifold” in the differential geometric sense of the term, since for example the local dimension can vary at different points. However, it is a convenient metaphor. [7]从形式上讲,它不是该术语的微分几何意义上的“歧管”,因为例如局部尺寸可以在不同点发生变化。但是,这是一个方便的隐喻。 [8] The more correct term is “non-Euclidean” rather than “non-linear”. [8]更正确的术语是“非欧几里得”而不是“非线性”。 [9] J. B. Tenenbaum et al., A global geometric framework for nonlinear dimensionality reduction (2000), Science 290:2319–2323. Introduced the Isomap algorithm that embeds the data manifold by trying to preserve the geodesic distances on it, approximated using a k-NN graph. Geodesic distances on the graph are the lengths of the shortest paths connecting any pair of points, computed by means of the Dijkstra algorithm. Endowed with such a distance metric, the dataset is considered as a (non-Euclidean) metric space. A configuration of points in a low-dimensional space whose pairwise Euclidean distances are equal to the graph geodesic distances is known as isometric embedding in metric geometry. Usually, isometric embeddings do not exist and one has to resort to an approximation that preserves the distances the most in some sense. One way of computing such an approximation is by means of multidimensional scaling (MDS) algorithms. [9]J. B. Tenenbaum et al., A global geometric framework for nonlinear dimensionality reduction (2000), Science 290:2319–2323. 引入了Isomap算法,该算法通过尝试保留测地距离来嵌入数据流形,该距离使用k-NN图近似。图上的测地线距离是通过Dijkstra算法计算的连接任何一对点的最短路径的长度。有了这样的距离度量,数据集就被视为(非欧几里得)度量空间。两维欧氏距离等于图形测地线距离的低维空间中的点的配置称为度量几何中的等距嵌入。通常情况下,不存在等距嵌入,因此必须诉诸某种近似来在某种意义上最大程度地保留距离。一种计算这种近似值的方法是借助多维缩放(MDS)算法。 [10] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding (2000). Science 290:2323–2326. [11] M, Belkin and P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering (2001). Proc. NIPS. [12] J. Svoboda et al. PeerNets: Exploiting peer wisdom against adversarial attacks (2019), Proc. ICLR uses GNN module that aggregates information from multiple images to reduce the sensitivity of CNNs to adversarial perturbations of the input. [12] J. Svoboda et al. PeerNets: Exploiting peer wisdom against adversarial attacks (2019), Proc. ICLR使用GNN模块来聚合来自多个图像的信息,以降低CNN对输入的对抗性扰动的敏感性。 [13] Non-local means is a non-linear image filtering technique introduced by A. Buades et al., A non-local algorithm for image denoising (2005), Proc. CVPR. It can be seen as a precursor to modern attention mechanisms used in deep learning. Non-local means itself is a variant of edge-preserving diffusion methods such as the Beltrami flow proposed by my PhD advisor Ron Kimmel in the paper R. Kimmel et al., From high energy physics to low level vision (1997), Proc. Scale-Space Theories in Computer Vision, or the bilateral filter from C. Tomasi and R. Manduchi, Bilateral filtering for gray and color images (1998). Proc. ICCV. [13]非局部均值是由A. Buades等人介绍的一种非线性图像滤波技术,一种图像去噪的非局部算法(2005年),Proc.Natl.Acad.Sci.USA,87:3877-5。 CVPR。可以将其视为深度学习中使用的现代注意力机制的先驱。非局部均值本身是保留边缘的扩散方法的一种变体,例如我的博士生顾问Ron Kimmel在论文R.Kimmel等人(从高能物理学到低视力(1997年),Proc.Natl.Acad.Sci.USA)中提出的Beltrami流。计算机视觉的比例空间理论,或C. Tomasi和R. Manduchi的双边过滤器,用于灰度和彩色图像的双边过滤器(1998年)。进程ICCV。 [14] Adversarial perturbation is a carefully constructed input noise that significantly reduces the performance of CNNs e.g. in image classification. This phenomenon was described in C. Szegedy et al. Intriguing properties of neural networks (2014), Proc. ICLR, and resulted in multiple follow-up works that showed bizarre adversarial attacks as extreme as changing a single pixel (J. Su et al. One pixel attack for fooling deep neural networks (2017), arXiv:1710.08864) or data-independent “universal” perturbations (S. M. Moosavi-Dezfooli et al., Universal adversarial perturbations (2017), Proc. CVPR). [14] Adversarial perturbation is a carefully constructed input noise that significantly reduces the performance of CNNs e.g. in image classification. This phenomenon was described in C. Szegedy et al. Intriguing properties of neural networks (2014), Proc. ICLR, and resulted in multiple follow-up works that showed bizarre adversarial attacks as extreme as changing a single pixel (J. Su et al. One pixel attack for fooling deep neural networks (2017), arXiv:1710.08864) or data-independent “universal” perturbations (S. M. Moosavi-Dezfooli et al., Universal adversarial perturbations (2017), Proc. CVPR). [14]对抗性扰动是精心构造的输入噪声,会显着降低CNN的性能,例如在图像分类中。这种现象在C. Szegedy等人中得到了描述。神经网络的有趣属性(2014),过程。 ICLR,并进行了多项后续研究,这些研究表明,怪异的对抗攻击就像改变单个像素一样极端(J. Su等人。愚弄深度神经网络的一次像素攻击(2017年),arXiv:1710.08864)或与数据无关的“通用”扰动(SM Moosavi-Dezfooli等人,“通用对抗性扰动”(2017年),Proc。CVPR)。 [15] Y. Yu et al., DAG-GNN: DAG structure learning with graph neural networks (2019). Proc. ICML. [16] T. Kipf et al., Neural relational inference for interaction systems (2019). Proc. ICML. Recovers a graph “explaining” the physics of a system by using a variational autoencoder, in which the latent vectors represent the underlying interaction graph and the decoder is a graph neural network. [16] T. Kipf et al., Neural relational inference for interaction systems (2019). Proc. ICML.通过使用变分自动编码器来恢复“解释”系统物理关系的图,其中潜矢量表示基础交互图,而解码器是图神经网络。 [17] The use of GNNs in high-energy physics is a fascinating topic worth a separate post. Together with my PhD student Federico Monti we have worked with the IceCube collaboration developing probably the first GNN-based approach for particle physics. Our paper N. Choma, F. Monti et al., Graph neural networks for IceCube signal classification (2018), Proc. ICMLA, where we used the MoNet architecture for astrophysical neutrino classification, got the best paper award. In a more recent work, S. R. Qasim et al., Learning representations of irregular particle-detector geometry with distance-weighted graph networks (2019), European Physical Journal C 79, used a variant of DGCNN similar to DGM called GravNet for particle reconstruction. [17]在高能物理中使用GNN是一个有趣的话题,值得单独撰写。我们与我的博士生Federico Monti一起与IceCube合作,开发了第一个基于GNN的粒子物理学方法。我们的论文N.Choma,F.Monti等人,《用于IceCube信号分类的图神经网络》(2018年),Proc。我们将MoNet架构用于天体中微子分类的ICMLA获得了最佳论文奖。在最近的工作中,S.R。 [18] A somewhat related class of approaches are generative graph models, see e.g. Y. Li et al, Learning deep generative models of graphs (2018). arXiv:1803.03324. One of the applications is generating molecular graphs of chemical compounds that adhere to strict construction rules. [18]一类相关的方法是生成图模型,请参见例如Y.Li等人,《学习图形的深度生成模型》(2018年)。 arXiv:1803.03324。一种应用是生成遵守严格构造规则的化学化合物的分子图。 [19] There are many more works on latent graph learning papers that have appeared in the past couple of years — if I omit some, this is because my goal is not to be exhaustive but rather to show a principle. I will refer to one additional work of L. Franceschi et al. Learning discrete structures for graph neural networks (2019). Proc. ICML, which also mentions the relation to Isomap and manifold learning techniques. [19]在过去几年中出现了很多有关潜图学习论文的作品-如果我省略一些,那是因为我的目标不是详尽无遗,而是要展示一个原理。我将参考L. Franceschi等人的另一篇著作。学习图神经网络的离散结构(2019)。进程ICML,还提到了与Isomap和多种学习技术的关系。

2

图像分类

AI小助手·2020-11-13 14:43 1 阅读 73
文章【论文分享】Learning_Rich_Features
写在前面第一次写这种笔记感觉,不大知道结构啥的,把自己觉得重要的东西摘出来记录一下吧。 摘要 To address these issues, we propose a coarse to-fine architecture named Constrained R-CNN for complete and accurate image forensics. First, the learnable manipulation feature extractor learns a unified feature representation directly from data. 大多数的基于深度学习的模型在做图像操作检测的时候关注的是操作区域的定位,而忽略了操作的分类。为了解决这样的问题,本文提出了由粗到细的架构。首先,可学习的操作特征提取器将从数据中直接学习独特的特征表示(将此种方法与SRM filter进行了对比) Second, the attention region proposal network effectively discriminates manipulated regions for the next manipulation classification and coarse localization. 之后,注意区域建议网络(RPN-A)则有效地辨别出经历过处理的区域,为之后的操作分类以及粗略定位作准备 Then, the skip structure fuses low-level and high-level information to refine the global manipulation features. Finally, the coarse localization information guides the model to further learn the finer local features and segment out the tampered region. 然后,融合了低级和高级的信息来细化全局的操作特征。最后,粗略定位信息指导模型来进一步地学习到更好的局部特征,并且把篡改过的区域分割出来。 引言 Nevertheless, these methods only mitigate the issue of insufficient generalization ability, they are still fundamentally limited by handcrafted or predetermined features. 目前基于深度学习的方法相比于早期的图像操纵检测方法,如频域特征(frequency domain characteristics)和CFA(color filter array),更多的是提高了模型的泛化性能。一般都是先提取特征,再将特征单独或是一起送入到深度学习的模型当中,来检测像素是否经过篡改。这样的基于深度学习的方法从根本上依然受手动或是预定特征的限制。 Such as the copy-move image in Fig. 1, if only the mask is provided, we will probably mistake it as a splicing image. Therefore, if the model only focuses on one goal, the results of forensics will not be convincing. 并且,图像操作的检测包含的两个目标:操作技术的分类以及操作的定位。通常的方法只关心后者,而忽略了前者的重要性。例如,下图的copy-move图像,如果最终只输出mask,则容易被认为是splicing图像。因此,模型只关注一个目标,最终取证的结果就不会有说服力。 This motivates us to design a general solution that applies to detect various content manipulation, which achieves manipulation techniques classification and tampered region segmentation simultaneously 这促使我们设计了一种适用于检测各种内容操纵的通用解决方案,它同时实现了操纵技术分类以及篡改区域分割。 For the first issue, we design a learnable manipulation feature extractor (LMFE) based on the constrained convolution layer [10] to create a unified feature representation of various content manipulation directly from data. 针对第一个问题,即操纵技术分类,我们设计了基于约束的卷积层的可学习操作特征提取器(LMFE),直接从数据中产生各种内容操作的统一的特征表示 针对第二个问题,即篡改区域分割,我们设计了一个两阶段的架构来模拟现实中从粗到细的取证过程。详细来说,我们使用了Mask R-CNN,包括了两个阶段: 设计了注意区域建议网络(RPN-A),以识别经过篡改的区域,用于后面的操作技术分类以及粗略的定位。跳跃结构融合低级与高级的信息,增强了全局特征表示。来自第一阶段的边框信息(bounding box information)将指导第二阶段关注边框中的局部特征,用于最后的篡改区域分割。 我们的主要贡献如下: A coarse-to-fine architecture named Constrained R-CNN,能够同时实现操纵技术分类以及篡改区域分割A single-stream learnable manipulation feature extractor (LMFE),直接从数据中创建各种内容操作技术的统一特征表示An attention regional proposal network (RPN-A) ,有效区分被篡改过的区域。 所提出的方法现实世界中的取证往往遵循着这样的过程规律,粗略地观察图像来进行近似篡改定位,而在精细阶段,则对特定的区域进行更加详细的分析,寻找更多的线索来对篡改区域进行分割。 如下图,Constrained R-CNN由可学习操作特征提取器(LMFE),粗略操作检测(Stage-1),精细操作分割(Stage-2)组成。 LMFE模块捕获各种内容操作的伪造线索,并创建统一的特征表示。Stage-1,包含了注意区域建议网络(RPN-A)和预测模块,执行操纵技术的分类以及边框回归(bounding box regression)。Stage-2,跳跃结构融合多级信息,增强全局特征表示。作为先验知识,Stage-1的边框引导Stage-2聚焦局部特征,用于之后的操作分割。 Learnable manipulation feature extractor(LMFE) SRM 滤波器常用于捕捉操纵线索,例如边界异常、噪声模式不一致,然而,由于手工方法的限制,SRM 滤波器容易受到恶意攻击。与此相比,从数据中自适应地学习操纵特征的约束卷积层(constrained convolution layer),能够提供更好的通用性与鲁棒性。约束应用如下:其中, [公式] 表示的是第k个卷积核,(0,0)是 [公式] 的中心坐标。 用整个模型来更新 [公式] ,然后执行上述约束过程。 下图比较了SRM滤波器与约束卷积层的可视化结果,约束卷积层能够保留更丰富的信息,从篡改区域的细节上看,约束卷积层能够突出真实区域与篡改区域之间的噪声不一致性。根据这样的结果,本文设计了一个基于约束卷积层(constrained convolution layer)的可学习操纵特征提取器(LMFE),用来自适应地学习操纵检测特征。 如下图,LMFE以被操纵的图像作为输入,通过约束卷积层能够捕捉丰富的操纵线索。然后将约束卷积层特征输入到ResNet-101中,用来创建各种内容操纵的统一的特征表示。由于单流的设计,LMFE与《Learning rich features for image manipulation detection》中的方法相比,只有一半的参数。Coarse Manipulation Detection (Stage-1) Stage-1的模型主要是用来进行操纵技术的分类以及粗略的操纵区域定位,包括:注意区域建议网络(RPN-A)和预测模块 注意区域建议网络(RPN-A): 尽管约束卷积层能够捕捉到丰富的操纵线索,但会丢失大量的内容信息,比如颜色,亮度。内容信息的缺乏会削弱全局特征的类间区分,这会导致操纵分割也被削弱。因此,本文设计了注意区域建议网络来解决这个问题,能够让网络学习具有很强的类间区分能力的特征表示。 具体来说,RPN-A添加了卷积块的注意机制(convolution block attention model,CBAM),CBAM能够沿着两个独立的维度,即空间维度与通道维度推断出两张注意力地图。然后将这两个注意力图与输入特征图相乘,得到更具区别性的特征。利用CBAM的特征图提出了关注区域,也就是潜在的篡改区域。

0

目标检测

flyai会员1604471550·2020-11-11 14:11 0 阅读 77
文章NLP 数据扩增的可视化综述
FlyAI文献翻译英文原文::A Visual Survey of Data Augmentation in NLP 标签 :自然语言处理 Unlike Computer Vision where using image data augmentation is standard practice, augmentation of text data in NLP is pretty rare. This is because trivial operations for images like rotating an image a few degrees or converting it into grayscale doesn’t change its semantics. This presence of semantically invariant transformation is what made augmentation an essential toolkit in Computer Vision research. 与使用图像数据扩增是标准做法的计算机视觉不同,NLP 中文本数据的扩增非常罕见。这是因为图像的琐碎操作(如旋转图像几度或将其转换为灰度)不会更改其语义。这种语义不变转换的存在使得扩充成为计算机视觉研究中必不可少的工具包。 I was curious if there were attempts at developing augmentation techniques for NLP and explored the existing literature. In this post, I will share my findings of the current approaches being used for augmenting text data. 我很好奇,是否有尝试开发NLP的扩增技术,并探索现有的文献。在本帖子中,我将分享我关于当前用于扩充文本数据的方法的发现。 Approaches Lexical SubstitutionThis approach tries to substitute words present in a text without changing the gist of the sentence. Thesaurus-based substitutionIn this technique, we take a random word from the sentence and replace it with its synonym using a Thesaurus. For example, we could use the WordNet lexical database for English to look up the synonyms and then perform the replacement. It is a manually curated database with relations between words. 方法1.词汇替换 此方法尝试在不改变语义的情况下替换词汇 基于同义词库的替换 在此技术中,我们从句子中随机抽取一个单词,并使用同义词库将其替换为同义词。例如,我们可以使用 WordNet 词汇数据库查找同义词,然后执行替换。这是一个手动整理的数据库,词与词之间有一定关联。在此技术中,我们从句子中随机抽取一个单词,并使用同义词库将其替换为同义词。例如,我们可以使用 WordNet 词汇数据库查找同义词,然后执行替换。这是一个手动整理的数据库,词与词之间有一定关联。 Zhang et al. used this technique in their 2015 paper “Character-level Convolutional Networks for Text Classification”. Mueller et al. used a similar strategy to generate additional 10K training examples for their sentence similarity model. NLTK provides a programmatic access to WordNet. You can also use TextBlob API. There is also a database called PPDB containing millions of paraphrases that you can download and access programmatically. Zhang等人在2015年的论文《文字分类的字符级卷积网络》中使用了这一技术。Mueller等人使用类似的策略为其句子相似性模型生成额外的 10K 训练示例。 NLTK 支持对 WordNet 的程序化访问。你还可以使用 TextBlob 的API接口。还有一个叫PPDB的数据库,其中包含数百万个支持程序化访问和下载的语料。 Word-Embeddings Substitution In this approach, we take pre-trained word embeddings such as Word2Vec, GloVe, FastText, Sent2Vec, and use the nearest neighbor words in the embedding space as the replacement for some word in the sentence. Jiao et al. have used this technique with GloVe embeddings in their paper “TinyBert” to improve generalization of their language model on downstream tasks. Wang et al. used it to augment tweets needed to learn a topic model. 单词嵌入替换 在此方法中,我们通过诸如 Word2Vec、GloVe、FastText、Sent2Vec等预先训练的嵌入词库,使用嵌入空间中接邻的单词,来替换句子中某些单词。Jiao等人在论文《小贝特》中使用了GloVe嵌入这项技术,用于改进其语言模型对下游任务的概括。Wang等人用它来扩充学习主题模型所需的推文。 For example, you can replace the word with the 3-most similar words and get three variations of the text. 举个例子,你可以将某个单词替换为 3 个最相似的单词,并获得文本的三个变体。 It’s easy to use packages like Gensim to access pre-trained word vectors and get the nearest neighbors. For example, here we find the synonyms for the word ‘awesome’ using word vectors trained on tweets. 使用 Gensim 这样的包来访问预先训练的单词矢量并获取接邻词十分简单。如下操作,我们通过在推文中训练的单词矢量,找到5个“awesome”的同义词。 # pip install gensimimport gensim.downloader as apimodel = api.load('glove-twitter-25') model.most_similar('awesome', topn=5) You will get back the 5 most similar words along with the cosine similarities. 你将得到5个最相似的单词和应对应的余弦相似值。 [('amazing', 0.9687871932983398), ('best', 0.9600659608840942), ('fun', 0.9331520795822144), ('fantastic', 0.9313924312591553), ('perfect', 0.9243415594100952)] Masked Language ModelTransformer models such as BERT, ROBERTA and ALBERT have been trained on a large amount of text using a pretext task called “Masked Language Modeling” where the model has to predict masked words based on the context. This can be used to augment some text. For example, we could use a pre-trained BERT model and mask some parts of the text. Then, we use the BERT model to predict the token for the mask. 掩码语言模型 使用前置任务进行大量语料训练的转换模型,被称为“掩码语言模型”,例如BERT, ROBERTA及ALBERT等。这些模型需要根据语境,预测被掩码的单词。 这可以扩充某些文本。例如,我们可以先拿训练好的BRERT模型,对其中部分文本惊醒掩码,再使用BERT模型来预测掩码的特征。 Thus, we can generate variations of a text using the mask predictions. Compared to previous approaches, the generated text is more grammatically coherent as the model takes context into account when making predictions. 于是,通过对掩码的预测,我们就能生成许多文本的变体。与前面的方法相比,这个方法生成的文本更加符合语法逻辑,因为模型在进行预测时参考了上下文。 This is easy to implement with open-source libraries such as transformers by Hugging Face. You can set the token you want to replace with and generate predictions. 使用Hugging Face等开源库,实现起来非常简单。你可以将(上述示意图中的)替换成任何特征,并生成预测。 from transformers import pipelinenlp = pipeline('fill-mask')nlp('This is <mask> cool') [{'score': 0.515411913394928, 'sequence': '<s> This is pretty cool</s>', 'token': 1256}, {'score': 0.1166248694062233, 'sequence': '<s> This is really cool</s>', 'token': 269}, {'score': 0.07387523353099823, 'sequence': '<s> This is super cool</s>', 'token': 2422}, {'score': 0.04272908344864845, 'sequence': '<s> This is kinda cool</s>', 'token': 24282}, {'score': 0.034715913236141205, 'sequence': '<s> This is very cool</s>', 'token': 182}] However one caveat of this method is that deciding which part of the text to mask is not trivial. You will have to use heuristics to decide the mask, otherwise the generated text will not retain the meaning of original sentence. 然而,此方法的缺陷之一,是对被掩码的文本有一定要求。你必须使用探索法来决定文本被掩码的部分,否则,生成的文本将无法保留原文本的语义。 TF-IDF based word replacement This augmentation method was proposed by Xie et al. in the Unsupervised Data Augmentation paper. The basic idea is that words that have low TF-IDF scores are uninformative and thus can be replaced without affecting the ground-truth labels of the sentence. 基于TF-IDF的单词替换 这种扩充方法由Xie等人在关于无监督数据增强的论文中提出。其基本思想是,TF-IDF得分较低的单词缺少信息价值,因此在替换后,不会影响语句的基本事实标签。 The words to replace with are chosen from the whole vocabulary that have low TF-IDF scores in the whole document. You can refer to the implementation in the original paper from here. 我们可以从通篇中,选取TF-IDE得分较低的词,用于替换。这里可以参阅原论文中的具体实现。 2. Back Translation In this approach, we leverage machine translation to paraphrase a text while retraining the meaning. Xie et al. used this method to augment the unlabeled text and learn a semi-supervised model on IMDB dataset with only 20 labeled examples. The method outperformed the previous state-of-the-art model trained on 25,000 labeled examples. The back-translation process is as follows: Take some sentence (e.g. in English) and translate to another Language e.g. French Translate the french sentence back into English sentence Check if the new sentence is different from our original sentence. If it is, then we use this new sentence as an augmented version of the original text. 2.回译 在这种方法中,我们利用机器翻译来解释文本,同时重新训练意思。谢等,在IMDB数据集上,用该方法对未标记文本进行扩充,并学习了一个半监督模型。该方法优于先前在25000个标记样本上训练的最新模型。 回译过程如下: 把一些句子(如英语)翻译成另一种语言,如法语 把法语句子译成英语句子 检查新句子是否与我们原来的句子不同。如果是这样的话,我们就用这个新句子作为原文的扩充版。 You can also run back-translation using different languages at once to generate more variations. As shown below, we translate an English sentence to a target language and back again to English for three target languages: French, Mandarin and Italian. 你还可以使用不同的语言同时运行翻译以生成更多变体。如下图所示,我们将一个英语句子翻译成目标语言,然后再翻译成英语。三种目标语言的英语:法语、汉语和意大利语。 This technique was also used in the 1st place solution for the “Toxic Comment Classification Challenge” on Kaggle. The winner used it for both training-data augmentations as well as during test-time where the predicted probabilities for English sentence along with back-translation using three languages(French, German, Spanish) were averaged to get the final prediction. For the implementation of back-translation, you can use TextBlob. Alternatively, you can also use Google Sheets and follow the instructions given here to apply Google Translate for free. 这项技术也被应用于Kaggle的“Toxic Comment Classification Challenge”的第一名解决方案中。胜利者将其用于训练数据扩充,以及在测试期间使用三种语言(法语、德语、西班牙语)对英语句子和反译的预测概率进行平均,以获得最终预测。 对于反向翻译的实现,可以使用TextBlob。或者,你也可以使用谷歌工作表,并按照这里给出的说明免费应用谷歌翻译。 3. Text Surface Transformation These are simple pattern matching transformations applied using regex and was introduced by Claude Coulombe in his paper. In the paper, he gives an example of transforming verbal forms from contraction to expansion and vice versa. We can generate augmented texts by applying this. 3.文本表面转换 这些是使用正则表达式应用的简单模式匹配转换,由克劳德·库仑在他的论文中介绍。 在这篇文章中,他举了一个例子来说明语言形式从收缩到扩张,反之亦然。我们可以通过应用这个来生成增强文本。 Since the transformation should not change the meaning of the sentence, we can see this can fail in case of expanding ambiguous verbal forms like: To resolve this, the paper proposes that we allow ambiguous contractions but skip ambiguous expansion. 为了解决这个问题,这篇文章提出允许歧义收缩,从而跳过歧义扩展。 You can find a list of contractions for the English language here. 你可以在这里找到英语的缩略语列表。 4. Random Noise Injection The idea of these methods is to inject noise in the text so that the model trained is robust to perturbations. 4. 随机噪声注入 这些方法的思想是在文本中注入噪声,使训练的模型对扰动具有鲁棒性 Spelling error injection In this method, we add spelling errors to some random word in the sentence. These spelling errors can be added programmatically or using a mapping of common spelling errors such as this list for English. 拼写错误注入 在这种方法中,我们在句子中随机添加一些单词的拼写错误。这些拼写错误可以通过编程方式添加,也可以使用常见拼写错误的映射来添加,例如此英语拼写错误列表。 QWERTY Keyboard Error Injection This method tries to simulate common errors that happen when typing on a QWERTY layout keyboard due to keys that are very near to each other. The errors are injected based on keyboard distance. QWERTY键盘错误注入 此方法尝试模拟在QWERTY布局键盘上键入时由于按键彼此非常接近而发生的常见错误。错误是根据键盘距离注入的。 Unigram NoisingThis method has been used by Xie et al. and also the UDA paper. The idea is to perform replacement with words sampled from the unigram frequency distribution. This frequency is basically how many times each word occurs in the training corpus. 单格噪声 该方法已被谢等人采用。还有UDA论文。其思想是用从单字图频率分布中抽取的单词进行替换。这个频率基本上就是每个单词在训练语料库中出现的次数。 Blank Noising This method has been proposed by Xie et al. in their paper. The idea is to replace some random word with a placeholder token. The paper uses “_” as the placeholder token. In the paper, they use it as a way to avoid overfitting on specific contexts as well as a smoothing mechanism for the language model. The technique helped improve perplexity and BLEU scores in the paper. 空白噪音 这种方法是由谢等提出的。在他们的论文上。这个想法是用占位符标记替换一些随机单词。论文使用“_”作为占位符标记。在论文中,他们使用它来避免对特定语境的过度拟合,以及语言模型的平滑机制。这项技术有助于改善论文中的疑问点和BLEU分数。 Sentence Shuffling This is a naive technique where we shuffle sentences present in a training text to create an augmented version. 换位 这是一种简单的技巧,我们将训练文本中出现的句子打乱,以创建一个扩大的版本。 Instance Crossover Augmentation This technique was introduced by Luque in his paper on sentiment analysis for TASS 2019. It is inspired by the chromosome crossover operation that happens in genetics.In the method, tweets are divided into two halves and two random tweets of the same polarity(i.e. positive/negative) have their halves swapped. The hypothesis is that even though the result will be ungrammatical and semantically unsound, the new text will still preserve the sentiment polarity. 实例交叉扩充 这项技术是由Luque在其关于TASS 2019情绪分析的论文中介绍的。它的灵感来自遗传学中的染色体交叉操作。 在该方法中,tweet被分成两半,两个具有相同极性(即正/负)的随机tweet被交换。假设是,即使结果不符合语法和语义,新的文本仍然会保留情感的极性。 This technique had no impact on the accuracy but helped with the F1 score in the paper showing its impact on minority classes such as Neutral class with fewer tweets. 这项技术对准确性没有影响,但有助于F1分数的提高,这篇论文显示了它对少数民族阶级的影响,如中立阶级,推特数量较少。 Syntax-tree Manipulation This technique has been used in the paper by Coulombe. The idea is to parse and generate the dependency tree of the original sentence, transform it using rules and generate a paraphrased sentence.For example, one transformation that doesn’t change the meaning of the sentence is the transformation from active voice to the passive voice of sentence and vice versa. 语法树操作 库仑在本文中已经使用了这一技术。其思想是解析并生成原语句的依存树,利用规则对其进行转换,生成一个复述句。 例如,一个不改变句子意思的转换是句子的主动语态到被动语态的转换,反之亦然。 MixUp for TextMixup is a simple yet effective image augmentation technique introduced by Zhang et al. in 2017. The idea is to combine two random images in a mini-batch in some proportion to generate synthetic examples for training. For images, this means combining image pixels. The method acts as a form of regularization. MixUp for Text Mixup is a simple yet effective image augmentation technique introduced by Zhang et al. in 2017. The idea is to combine two random images in a mini-batch in some proportion to generate synthetic examples for training. For images, this means combining image pixels. The method acts as a form of regularization. 文本混合 Mixup是一种被zhang等提出的简单而有效的图像增强技术。2017年。其思想是将两幅随机图像按一定比例组合成一个小批量,生成用于训练的综合示例。对于图像,这意味着合并图像像素。该方法作为正则化的一种形式。 Bringing this idea to NLP, Guo et al. modified Mixup to work with text. They propose two novel approaches for applying Mixup to text: wordMixup: In this method, two random sentences in a mini-batch are taken and they are zero-padded to the same length. Then, their word embeddings are combined in some proportion. The resulting word embedding is passed to the usual flow for text classification. The cross-entropy loss is calculated for both the labels of the original text in the given proportion. 单词混合: 在这种方法中,两个随机句子在一个mini-batch中被抽取,并且它们被零填充到相同的长度。然后按一定的比例组合词。生成的单词嵌入被传递到文本分类的常规流。在给定的比例下,计算两个原文标签的交叉熵损失。 sentMixup: In this method, two sentences are taken and they are zero-padded to the same length. Then, their word embeddings are passed through LSTM/CNN encoder and we take the last hidden state as sentence embedding. These embeddings are combined in a certain proportion and then passed to the final classification layer. The cross-entropy loss is calculated based on both the labels of original sentences in the given proportion. 句子混合: 在这种方法中,取两个句子,它们被零填充到相同的长度。然后,通过LSTM/CNN编码器将它们的单词嵌入传递到最后一个隐藏状态作为句子嵌入。这些嵌入按一定比例组合,然后传递到最终的分类层。交叉熵损失是根据原句的两个labels按一定比例计算的。 Implementation Python libraries like nlpaug and textattack provide simple and consistent API to apply the above methods. They are framework agnostic and can be easily integrated into your pipeline. 应用 像nlpaug和textack这样的Python库提供了简单而一致的API来应用上述方法。它们与框架无关,可以很容易地集成到您的pipline中。 Conclusion My takeaway from the literature review is that many of these augmentation methods are very task-specific and their impact on performance has been studied for some particular use-cases only. It would be an interesting research to systematically compare these methods and analyze their impact on performance for many tasks. 结论 我从文献综述中得到的结论是,这些增强方法中的许多都是非常特定于任务的,它们对性能的影响只针对一些特定的用例进行了研究。系统地比较这些方法并分析它们对许多任务性能的影响将是一项有趣的研究。 Citation Info (BibTex) If you found this blog post useful, please consider citing it as: 引文信息(BibTex) 如果你觉得这篇博文有用,请考虑引用它: @misc{chaudhary2020nlpaugment, title = {A Visual Survey of Data Augmentation in NLP}, author = {Amit Chaudhary}, year = 2020, note = {\url{https://amitness.com/2020/05/data-augmentation-for-nlp} } 参考文献 Qizhe Xie, et al. “Unsupervised Data Augmentation for Consistency Training” Claude Coulombe “Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs” Xiaoqi Jiao, et al. “TinyBERT: Distilling BERT for Natural Language Understanding” Xiang Zhang, et al. “Character-level Convolutional Networks for Text Classification” Franco M. Luque “Atalaya at TASS 2019: Data Augmentation and Robust Embeddings for Sentiment Analysis” Ziang Xie, et al. “Data Noising as Smoothing in Neural Network Language Models” Hongyu Guo, et al. “Augmenting Data with Mixup for Sentence Classification: An Empirical Study” Hongyi Zhang, et al. “mixup: Beyond Empirical Risk Minimization”

2

自然语言处理

AI小助手·2020-11-11 14:10 2 阅读 84
没有更多了