Our Latest News

SLaK: Scaling the convolution kernel to 51×51 from a sparsity perspective

SLaK: Scaling the convolution kernel to 51×51 from a sparsity perspective

In this paper, the authors propose a method for applying very large Kernel from the perspective of sparsity, which can smoothly scale the Kernel to 61×61 with better performance. Therefore, the authors named the model as Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51×51 convolutional kernels.

1 SLaK: Scaling up convolutional kernels to 51 × 51 from a sparsity perspective

Paper title: More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity

Paper: https://arxiv.org/abs/2207.03620

Code: https://github.com/VITA-Group/SLaK

1.1 SLaK Principle Analysis

The background of this paper is the emergence of convolutional neural networks with very large kernels, where the idea of large kernel convolution comes from the nature of ViT models to model global information. One representative work is RepLKNet, which increases the size of the convolutional kernel to 31×31 while obtaining classification performance comparable to that of Swin Transformer and better performance on downstream tasks. This paper explores the possibility of training extreme convolutions larger than 31×31. In this paper, we find that continuously increasing the convolutional kernels leads to performance saturation, so this paper intends to explore whether the performance gap can be eliminated by strategically expanding the convolutions. After continuous exploration, the authors of this paper propose a method to apply very large Kernel from the perspective of sparsity, which can smoothly expand the Kernel to 61×61 with better performance. Therefore, the authors named the model as Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51×51 convolutional kernels.

1.1.1 Background and motivation

In modern computer vision tasks, general-purpose vision models were first dominated by CNNs with deep and small Kernel. Since the advent of ViTs, the importance of modeling global information has been gradually discovered: several building blocks of ViTs: the class Self-attention module was found to have the ability to model global information [1]. A lot of work has also demonstrated that models can be trained with excellent performance even if the Token-Mixer is not designed in the form of a Query-Key-Value. Therefore, it is felt that this mode of Self-attention, which operates at a global scale or within a larger window, is the key to model performance improvement. Each output from a single Self-attention layer is able to collect information from a relatively large region. In contrast to the small sliding windows (3×3 or 5×5) of CNNs sharing weights, global attention or local attention with larger window sizes in ViTs directly enables each layer to capture large sensory fields.

Inspired by this trend, some recent works on CNNs (RepLKNet [2], ConvNeXt [3]) have obtained results comparable to Swin Transformer by designing larger Kernel. However, even with the use of the reparameterization technique, which uses parallel small Kernel branches to aid training, large Kernel still suffers from the problem of difficulty in training. In this paper, we find that the performance of RepLKNet gradually saturates as the Kernel size keeps increasing. It is still a mystery whether we can outperform the Transformer-based model by further scaling up the Kernel size from 31×31, so this paper intends to explore whether the performance gap can be eliminated by strategically expanding the convolution.

Specifically, the authors of this paper explore this issue from the perspective of sparsity. Sparsity is an important feature of the human visual system in the visual cortex (V1), where incoming stimuli can be assumed to be sparsely coded and selected. The authors have extensively investigated the trainability of the Big Kernel and in this article give three main observations.

Neither direct training of large Kernel models, nor auxiliary structural reparameterization techniques can scale the kernel size beyond 31 × 31.

Replacing a large M×M Kernel with two rectangular parallel kernels (M×N and N×M, where N < M) smoothly scales the kernel size to 61×61 and improves performance.

Using a sparse approach while increasing the width of the model can significantly improve the performance again.

Based on these observations, this paper proposes the Sparse Large Kernel Network (SLaK), a new pure CNN architecture equipped with an unprecedentedly large 51×51 Kernel. evaluated in various tasks including ImageNet image classification, ADE20K semantic segmentation, and object detection on PASCAL VOC 2007, SLaK achieves higher accuracy than SOTA CNNs (e.g., ConvNeXt, etc.) and SOTA Transformer (e.g., Swin, etc.). The effective receptive field (ERF) analysis also demonstrates that SLaK can cover a larger ERF region than existing ones, while introducing more human-like peripheral shape bias.

1.1.2 Dynamic sparsification technique

Dynamic sparsification is a technique for training sparse neural networks from scratch. Post-training pruning generally refers to training a large dense model first and then pruning its parameters. In dynamic sparsification, however, the model is sparse from the beginning, and the FLOPs and memory requirements for training and inference are only a fraction of those for a dense model. There is no pre-training involved.

As shown in Figure 1 below, the dynamic sparsification technique is derived from the Sparse Evolutionary Training (SET) method, which first randomly initializes the sparse connections between layers and dynamically adjusts the sparse connections of the model during training by means of a parameter pruning-growth scheme. This scheme allows the sparse structure of the model to evolve gradually, obtaining better performance than simply training a static sparse network.

Figure 1: Dynamic sparsification technique

1.1.3 Three observations on scaling the size of the convolution kernel to exceed 31×31

The authors first investigate the performance of extreme Kernels larger than 31×31 and share 3 main observations. The recent SOTA CNN architectures ConvNeXt and ImageNet-1K datasets are used as benchmarks.

The authors follow the general training strategy for backbone models, i.e., data augmentation by Mixup, Cutmix, RandAugment, Random Erasing . The regularization methods are: Stochastic Depth and Label Smoothing. adamW is used as optimizer, analytical experiments are trained for 120 Epochs in order to quickly observe the effect of the method, and formal experiments are trained for 300 Epochs in order to be able to make a fair comparison with the state-of-the-art model.

Observation 1: Existing methods (structural reparameterization techniques) cannot further scale the Kernel size from 31×31 upwards

RepLKNet successfully scales the convolution to 31×31 by structural reparameterization techniques, while allowing the model to achieve performance comparable to that of the Swin Transformer. In this paper, the authors further increase the kernel size to 51×51 and 61×61 to see if a larger kernel can bring more performance gains. Following the design in RepLKNet, the authors set the Kernel size of each Stage to [51, 49, 47, 13] and [61, 59, 57, 13] successively, and the results are shown in Figure 2 below. The results show that simply increasing the Kernel size from 7×7 to 31×31 significantly degrades the performance, while RepLKNet overcomes this problem by improving the accuracy by 0.5%. However, this trend does not apply to larger kernels, as increasing the size of the Kernel to 51×51 starts to hurt performance.

Figure 2: Test accuracy of ConvNeXt-T trained on ImageNet-1K with various large Kernels, naive means directly increasing the size of the Kernel and RepLKNet means using structural reparameterization techniques

A reasonable explanation for this phenomenon is that after we enlarge the convolutional kernel to 51×51 or 61×61, although the perceptual field of the model increases, it may not be able to maintain certain desired properties, such as localization. Since the Stem module of standard ResNet and ConvNeXt already downsamples the input by a factor of 4, a very large Kernel like 51×51 is in fact essentially equivalent to a global convolution (for a 224×224 ImageNet). Therefore, this observation makes sense because local attention (e.g., Swin) is usually better than global attention (e.g., DeiT) in similar mechanisms of ViTs. Inspired by this, the authors wish to address this issue with the help of locality, while making the model retain the ability to capture global relations.

Observation 2: Decomposing a large square Kernel into two rectangular, parallel irregular Kernels can smoothly scale the size of the Kernel to 61

The authors’ approach is to approximate the very large M × M Kernel with two parallel rectangular convolutions of Kernel sizes M × N and N × M (where N < M), as shown in Figure 2 below. This is equivalent to a modified RepLKNet, except that the parallel branches are changed to two. At the same time, there is one more 5×5 branch and the output of BN layer merges these three branches.

Figure 2: Approximation of the very large M×M Kernel with two parallel rectangular convolutions, which have Kernel sizes of M×N and N×M (where N < M), respectively

This decomposition not only inherits the ability of the very large Kernel to capture long-distance relationships, but also allows extracting local contextual features with shorter edges. In addition, as shown in Figure 3 below, the existing large Kernel training technique (RepLKNet) increases the computational and storage overhead squarely as the Kernel Size increases. n=5 is denoted as Decomposed, and since there is no longer a large Kernel of 31×31, Decomposed sacrifices a certain amount of accuracy compared to the structurally reparameterized 31×31. decomposed sacrifices some accuracy compared to the structurally reparameterized 31×31. However, as the convolution size increases to the global convolution, it can surprisingly scale the kernel size to 61 while improving performance.

Figure 3: Scaling efficiency of various large Kernel training methods when applied to ConvNeXt-T

Observation 3: Using more sparse Groups can increase the capacity of the model

The design principle of Convolution operation of ConvNeXt is to use Depth-wise Convolution while increasing the width, so this design idea can be summarized as “use more groups, expand width”. The design idea of this paper can be summarized as “use sparse groups, expand more”. Specifically, the authors first replace the dense convolution with sparse convolution, where the sparse kernel is built based on SNIP. After the construction, the authors train the sparse model with dynamic sparsification technique. Where the sparse weights are dynamically adjusted during training: first a random portion is cut off, and then the same number of weights are randomly added. This allows for dynamic adaptation of the sparse weights, resulting in better local features. Since the kernel is sparse throughout the training process, the corresponding Params and training/inference FLOPs are small. The authors used 40% sparsity and the result was noted as Sparse Decomposed.

Figure 4: Test accuracy of ConvNeXt under different experimental settings

The test accuracy of ConvNeXt under different experimental settings is shown in Figure 4 above. As can be seen in column 2, the number of parameters and computation of the model decreases significantly after using 40% sparsity, but there is a temporary decrease in performance. However, dynamic sparsity can improve the scalability of the model. Specifically, dynamic sparsity allows us to scale up the model. For example, using the same sparsity (40%), we can scale the model width by a factor of 1.3, while keeping the parameter counts and FLOPs approximately the same as the dense model. This results in a significant performance improvement, from 81.3% to 81.6% when using 51×51Kernel.

1.1.4 Sparse Large Kernel Networks: SLaK

So far, it has been possible to successfully scale the size of the Kernel to 61 without sacrificing the performance of the model. The method of scaling consists of 2 sparse-inspired designs. At the macro level, a coefficient model is constructed to increase the capacity of the model through dynamic sparsification techniques. At the micro level, the authors decompose an oversized Kernel into two complementary dynamically sparse kernels to improve the scalability of the oversized Kernel. The authors train SLaK directly from scratch without any pre-training or fine-tuning involved.

SLaK is built based on the architecture of ConvNeXt, which is inherited by the design of Stem. The number of blocks per Stage is [3, 3, 9, 3] for SLaK-T and [3, 3, 27, 3] for SLaK-S/B, and the Stem layer is a K=S=4 convolutional layer. The authors increased the size of the Kernel of each Stage of ConvNeXt to [51, 49, 47, 13] and replaced each M×M Kernel with a combination of M×5 and 5×M. The authors found it necessary to add a BN layer after M×5 and 5×M before summing, and following the principle of using more sparse groups, the authors further sparse the whole network and extend the width of each Stage by 1.3 times to finally obtain the SLaK-T/S/B model.

1.1.5 SLaK experimental results

Image classification experimental setup: ImageNet dataset, 300 Epochs, AdamW optimizer, Batch size: 4096, weight decay: 0.05, lr: 4e-3, 20-epoch linear warmup rate, cosine lr decay schedule, data Enhancement: RandAugmentation (rand-m9-mstd0.5-inc1), Label Smoothing (coefficient of 0.1), Mixup (α = 0.8), Cutmix (α = 1.0), Random Erasing (p = 0.25), Stochastic Depth with drop rate (0.1 SLaK-T, 0.4 SLaK-S, 0.5 SLaK-B), EMA (decay factor = 0.9999).

Semantic segmentation experimental setup: ADE20K dataset, ImageNet pre-trained pre-trained model, UperNet semantic segmentation model, training 80K-iteration, testing single-scale mIoU.

Target detection experimental setup: PASCAL VOC dataset, ImageNet pre-trained pre-training model, Faster-RCNN target detection model, training 36 Epochs, following Swin.

ImageNet Experimental Results

As shown in Figure 4 below, SLaK outperforms existing convolutional models, such as ResNe(X)t, RepLKNet, and ConvNeXt, with similar model params and FLOPs. Without using any complex self-attention module and patch embedding, SLaK is able to achieve better results than the state-of-the-art visual transformers (Swin, PVT, etc.). More interestingly, the direct replacement of 7×7 of ConvNeXt-S with 51×51 can improve the accuracy by 0.7%.

Figure 4: ImageNet experimental results

Semantic segmentation experimental results

The results are shown in Figure 5 below, and a very clear trend can be seen that the performance increases as the Kernel size increases: RepLKNet scales the Kernel size of ConvNeXt-T from 7×7 to 31×31, and the mIoU improved by 1.6%. Notably, SLaK-T with a larger Kernel (51×51) improves further by 0.9% mIoU than ConvNeXt-T with 31×31 kernels (RepLKNet), while requiring fewer FLOPs.

Figure 5: Semantic segmentation experimental results

Target Detection Experiment Results

Figure 6 below shows the comparison results of SLaK-T, ConvNeXt-T, RepLKNet and traditional convolutional networks (ResNet). Again, a large Kernel results in better performance. Specifically, ConvNeXt-T with 31×31 Kernel has 0.7% higher average accuracy (mAP) than 7×7 Kernel, and SLaK-T with 51×51 Kernel size brings a further 1.4% mAP improvement, highlighting the critical role of very large Kernel in downstream vision tasks.

Figure 6: Experimental results of target detection

1.1.6 Other discussions of SLaK

CNNs with shallow and large Kernel have a larger effective field of perception than CNNs with deep and small Kernel

In the original paper, the authors of RepLKNet discussed the effective receptive field of several network models: The RepLKNet authors concluded that a single large Kernel is much more effective than many small Kernels in terms of obtaining a large effective receptive field. According to the effective receptive field (ERF) theory, the size of the ERF is proportional to the size of the convolutional kernel, where is the size of the convolutional kernel, and is the depth, i.e., the number of layers. In other words, the ERF grows linearly with the size of the Kernel and sublinearly with the depth.

Therefore, the assumption behind the Kernel decomposition operation in SLaK is that two independent Kernels of M × N and N × M can well maintain the ability of large Kernel in capturing a large effective field of perception, while the short edge (N) of the convolutional kernel is also beneficial in capturing fine-grained local features. To evaluate this hypothesis, the authors compare the ERFs captured by SLaK and RepLKNet.

The authors selected 50 images from the validation set and resized them to 1024 × 1024, measured the contribution of pixels on the input image to the centroids of the feature maps generated in the last layer, and summed them to obtain a 1024 × 1024 matrix. The authors analyze the effective receptive fields of ResNet and RepLKNet as shown in Figure 7 below. The visualization of the effective perceptual field is done by

Let denote the input image and denote the final output feature, we want to measure the contribution of each pixel of the input image to the center position of the final output feature. This can be obtained by computing the derivative of the input through the autograd mechanism. Formally, the score matrix is given by the following equation.

Finally, the score matrix is rescaled to 0-1. In short, the score matrix measures the contribution of the corresponding pixels on the input image to the centroid of the feature map generated by the last layer. As shown in Figure 1 above, the more discrete the distribution of dark regions, the larger the effective receptive field (ERF). It was found that more layers (e.g., from ResNet-101 to ResNet-152) did little to expand the ERF. On the contrary, the effective receptive field is very large for the shallower large Kernel model.

Figure 7: Effective perceptual fields of ConvNeXt, RepLKNet and SLaK, and the size of the convolution kernel used

As can be seen in Figure 7, although the original ConvNeXt has increased the Kernel size to 7×7, the pixels of the input image that have a high contribution to a pixel of the output only appear in the central part. For RepLKNet, even though it uses a Kernel of 31×31, it is not enough for ConvNeXt’s effective perceptual field to cover the entire input. In contrast, the high contributing pixels of SLaK are distributed over a larger range of inputs, indicating a larger ERF. In addition, the receptive field of SLaK showed alternating light and dark, with the central region being brighter and the peripheral region a bit darker. This finding is fully consistent with the hypothesis that SLaK is capable of capturing not only long-range distance dependence but also local features.

The authors also did a quantitative analysis: given a threshold , the authors report the area ratio of the smallest rectangle whose contribution fraction of the covered area reaches the threshold , as shown in Figure 8 below. For example, for the ResNet-101 model, the contribution fraction of the middle 102×102 area reaches the threshold of 20%, and the area occupied is . The larger value means that the model considers a larger range of pixels to make decisions. As can be seen, with the global kernel, SLaK naturally has a higher area ratio than ConvNeXt and RepLKNet for the smallest rectangle.

Figure 8: Quantitative analysis results of ERF

Using more sparse Groups, the capacity of the model can be improved

In addition to that, the 3rd observation of this paper “Using more sparse Groups, can improve the capacity of the model”. In order to maintain the number of parameters and computation of the model as close as possible, the width has to be smaller when the sparsity of the model is low enough, and can be larger when the sparsity of the model is high. Thus, there is a Sparsity-Width tradeoff. To better understand this tradeoff, the authors chose five combinations of Sparsity-Width, all settings with approximately 5.0M FLOPs, but with different network widths. The experiments were performed on SLaK-T. As expected by the authors, the performance of the model keeps improving as the model width increases until the width factor reaches 1.5 times.

After that, as the sparsity increases further, the model starts to become highly sparse, making training difficult, so increasing the width further at this point starts to hurt performance.

Figure 9: Sparsity-Width tradeoff

Summary

This paper finds that continuously increasing the convolution kernel brings performance saturation, so this paper intends to explore whether the performance gap can be eliminated by strategically expanding the convolution. After continuous exploration, the authors of this paper propose a method to apply very large Kernel from the perspective of sparsity, which can smoothly expand the Kernel to 61×61 with better performance. Therefore, the authors named the model as Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51×51 convolutional kernels. The main strategy is to decompose a large square Kernel into two rectangular, parallel irregular Kernels, as well as to use more sparse Groups, while increasing the width of the model.

    GET A FREE QUOTE

    FPGA IC & FULL BOM LIST

    We'd love to

    hear from you

    Highlight multiple sections with this eye-catching call to action style.

      Contact Us

      Exhibition Bay South Squre, Fuhai Bao’an Shenzhen China

      • Sales@ebics.com
      • +86.755.27389663