CNN Transfer learnings have been widely used in computer vision such as image classification or pattern detection. Such models are based on specific architectures that have solved similar problems, and pretrained weights are loaded for faster training as well as better performance. However, a majorty of the models are specifically designed for colored images with three channels of input, so the pretrained weights contain redundant parts when the inputs are grayscale images. In this project, we present a method of removing superfluous filters and their corresponding weights for convolutional layers in a grayscale situation. Such processes have the potential benefit of simplifying CNN architectures, truncating model parameters and significantly reducing required training time given a trained colored model, at an extremely low cost of classification accuracy.
1 Preface and Introduction
Imagine that we see the object in red with a tuft of green shown above. We can definitely recognize it as a tomato. Basically, this is because the cones in our retina brings in the red and green color, and the rods help you build the smooth round shape. Then our brain is able to process such color and pattern information at a higher level of cognition. The cones lose their ability to respond to light when it gets darker. In such an environment, however, we can still recognize this tomato as before. This is because the rods continue to allow us to see its shape and brightness. Enlightened by this, we believe a neural network model can also achieve this—-even if we remove the color information from the original model, it can still remain high classification performance.
In studies of image recognition, there are many gray-scale pictures, such as chest radiographs and handwritten characters. Currently, the idea of recognizing such images is to apply transfer learning models that are essentially designed for training color pictures, mostly relying on the pretrained weights on imagenet. This can cause many redundant parameters during the process. This project aimed to discover a methodology to modify the models trained on color images and to apply them to gray-scale images.
2 Dataset and Models
In order to simplify the training process, this project used the well-known CIFAR-10 dataset. It consists of 60,000 32 x 32 pixels color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. Figure 1 shows the CIFAR-10 dataset images. While the left grid shows nine original images, the right grid shows the same nine images but transformed to grayscale using linear weights applied to the RGB channels.
2.2 Model Selection
We first want to verify that transfer learnings can be applied to grayscale images. For model selection, we investigated the classification accuracy of Resnet and a customized Densnet on both colored and grayscale CIFAR-10 dataset.
Initially, both the color and gray-scale images were trained using ResNet from scratch. Here, the idea of training gray-scale images is to triple the same information so that it can fit into the model. Then, pretrained weights were applied to both training processes. Eventually, the first convolutional layer of the ResNet was modified to only take a single greyscale channel as input. All models were trained for 50 epochs using the same loss function and optimizer. The training result is shown in Table 1.
Table 1 Resnet Result on Colored and Grayscale CIFAR-10
Initially, the colored images were trained using DenseNet without pretrained weights for 50 epochs. Then, the same model was applied to gray-scale images for the same number of epochs and using the same loss function and optimizer. It turned out the validation accuracy dropped by 3% as shown in Table 2. Since our modified Densenet result outperformed Resnet at a level of 50 Epochs, we used Densenet for our following filter clustering and compressing process. This Densenet model was further trained for 190 epochs, along with the accuracy reaching 91.16%. This model will then be used as the pretrained model, on which we will implement the process transforming from colored model to grayscale by compressing the kernels. The first convolutional layer was modified to only take a single gray-scale channel as input and the baseline accuracy for this grayscale input is 83.2%. The methods we implement the layer weights will be introduced later below.
Table 2 Densenet Result on Colored and Grayscale CIFAR-10
3.1 Receptive Fields Visualization of Color Model Layers
To start transforming our original model to grayscale model,we started pre-trained weights from the 190-epoch Densenet baseline model we mentioned above. The first procedure is to distinguish the receptive field of each filter in the convolutional layers, which refers to the part of the input image that is maximally visible to one filter at a time. To detect the receptive fields that each filter is meant to respond to, we used an algorithm called “gradient ascent in input space” to visualize them. The process is shown in Figure 3. Starting with noise, we apply gradient ascent to the value of the input image of a convolutional layer to maximize the response of a specific filter. Such visualization shows the image that can light up each filter the most. Figure 4 shows an example of the color input receptive fields in the first conv layer. Each square represents one of the 36 specific filters in layer 1.
Figure 4: The 36 receptive fields of the first convolutional layer. For example: the first filter is sensitive to green and the 7th filter below it is mostly responsive to dark blue.
Since the input images are gray-scale images, these receptive fields were converted to grayscale as well by taking a weighted average over the RGB channels. This is consistent with the formula transforming color images to gray-scale images used in image preprocessing. Figure 5 shows the grayscale receptive fields, representing the maximally responsive grayscale images. To increase the contrast for better visualization, the gray-scale filters were represented by blue and yellow (Pseudo colors). The receptive fields visualization shows that the first convolutional layer is only separating the colors and brightness but keeping all of the patterns for the following layers.
Figure 5: The 36 gray-scale receptive fields of the first convolutional layer. Comapred to the colored receptived fields in Figure 4, these receptive fields let us ignore color information that are fed into these filters.
Figure 6 and Figure 7 further show the receptive fields of the 2nd and 14th layers in both types, original color (left), and pseudocolor grayscale(right). The deeper the model is, the fewer the number of filters that are similar. The 14th conv layer is at the beginning of the second dense block. As shown in Figure 4, most of its filters have different shape patterns. It is very likely that clustering them can cause significant information loss, so this project will tentatively cluster and merge the filters in the first two convolutional layers only.
Figure 6: Receptive Fields of the 2nd convolutional layer
Figure 7: Receptive Fields of the 14th convolutional layer
3.2 Clustering Filters
To delete filters in charge of color information and reduce the number of filters focusing on similar patterns, we need to cluster them into different groups. First of all, to represent our filters, we continue to use the receptive fields for each of them. In such a way the filters can be represented as 32 by 32 matrices, and further regarded as one observation after flattening it to one dimension. Then we need to calculate the distance between each pair of these images as a measurement of their similarity.
For the first convolutional layer completely focusing on brightness, we used the MSE method, also known as Euclidean distance, to calculate the distance. This is because we are estimating the relative overall brightness difference. For the following layers that focused more on image patterns, we tried Structural Similarity Index and Image Euclidean Distance. Specifically, the Image Euclidean Distance can be written as:
As shown in Figure 8, these two distances consider the relative pixel positions within the matrix(image) when calculating the value difference between different pixels. Considering pixel positions can alleviate the problem caused by two images having similar patterns but such patterns shifting within a range of pixels.
DBSCAN  was used as the clustering method. Previous studies used k-means clustering to cluster the CNN kernels . The team found that k-means always results in evenly distributed kernels across different clusters, being not suitable for this case. This is becasue most of our convolutinal layers have only 18 filters and we still need to keep more than half of them, indicating a less number of filters for each cluster. Two ways were applied to measure the distances between every two filters in matrix forms for clustering. MSE  was applied in the first layer, and IMED  was used in the second layer. A threshold was set to select filters that can be grouped together.
The clustering result for the first and the second convolutional layer is shown below in Figure 8. It can be observed that the clustering result for the first layer filters is consistent with the filters’ maximized input visualization, indicating the MSE method worked well for brightness. For the second layer, our result shows that IMED and SSIM provided structurally similar clusters. The IMED result was used for implementing final filters compression later.
Figure 9: The upper panel shows the clustering result (matrix on the upper right corner) for the first convolutional layer. The lower panel shows the clustering result (matrix on the upper-lower corner) for the second convolutional layer.
3.3 Merging Filters
For each cluster of filters, we merge them into a single filter, as they process similar patterns or brightness. Every filter is a three-dimension tensor. As for the convolutional layers that need to be shrunk, the filters were simply averaged within each cluster. This means we take the average weights on the filters within each cluster and build a new one representing the pattern or brightness of that group. However, doing this also reduced the number of output channels and caused shape mismatch. Therefore, for the following convolutional layer, the weights in each filter were added across the third dimension according to how we clustered the previous layer. Figure 10 shows a simple demonstration pipeline of how the filters of the convolutional layer in a color model is merged based on known clustering resuls.
For this Densenet pretrained model, the dense layers are accumulated with previous dense layers. Therefore, to merge the filters in the first layer, the weights connected with following dense layers in the current denseblock also need to be added up. Theoretically, the pipeline is able to preserve most of the image pattern information after update the weights, but there is another important layer in DenseNet — batch normalization. These layers can be viewed as brightness and contrast adjustment. However, these features are not represented in the gradient ascend results, and they are hard to cluster. Therefore, these parameters were re-estimated by only training the batch normalization layer for one epoch, just to let it know the proper contrast and brightness. Then the validation results on CIFAR-10 dataset are compared between this modified model and the baseline model(after changing input images to grayscale).
All of our training and testing was performed on the CIFAR-10 dataset with DenseNet. We have made two versions of our clustered model, one only had its first layer clustered, another had its first and second layers clustered. The original weights of these two models come from a pretrained DenseNet on color images. Its accuracy is 91.16% on color images, and 83.20% on grayscale images. These two accuracies serve as our baselines.
The accuracy result comparisons are shown in Table 3. The result of merging filters in the first layer of the gray-scale model gives an accuracy of 85.43%. By clustering both the 1st and 2nd layers, the accuracy changed to 84.83%. The results are shown in Table 3. The accuracies of clustered gray-scale models are slightly worse than those of the first baseline, 91.16%, this is understandable as there was some information loss going from color images to gray-scale images. A positive outcome was that clustered and merged models outperformed the second baseline, 83.20%, and the number of parameters was reduced by about 5%.
Table 3. Test accuracy of CIFAR-10 using DenseNet after 190 epochs
5 Limitations and Future Works
The first limitation we identified is the information loss during such clustering and merging processes. As indicated by the results, removing color information and merging the filters in charge of the similar patterns definitely reduce the classification accuracy. In this project, the model with layer 1 and layer 2 kept a relatively high classification accuracy. For the convolutional layers at a higher(deeper) level, the information loss will increase, as the filters start focusing on more specific patterns so that less filters can be clustered. Secondly, the gradient ascent on receptive renders different results each time, causing uncertainties for the clustering result. According to our tests, however, the receptive fields differences result in minor influence on the clustering result.
We also identified some aspects that future works can focus on. First, filters of deeper convolutional layers are supposed to be clustered and merged. More implements can give us a more comprehensive view on how the model will perform as we remove all its redundant functions in charge of color information. Second, we can train more epochs on both convolutional layers and batch normal layers. This is for investigating whether the modified model can achieve a more accurate classification performance. Such a training process requires significantly less time and computational resources, compared to training a model with grayscale input from scratch. Lastly, future studies can also benefit from combining different clustering methods together to see if such clustering results will result in a less information loss.
Turn color images to grayscale:
 Luma (video). (2019, July 3). Retrieved from https://en.wikipedia.org/wiki/Luma_(video)
Clustering method used for filter clustering:
 2.3. Clustering. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/clustering.html#dbscan
Previous research on clustering CNN filter:
 Son, S., Nah, S., & Lee, K. M. (2018). Clustering Convolutional Kernels to Compress Deep Neural Networks. Computer Vision — ECCV 2018 Lecture Notes in Computer Science, 225–240. doi: 10.1007/978–3–030–01237–3_14
Code reference in image clustering:
 Llvll. (2016, January 19). llvll/imgcluster. Retrieved from https://github.com/llvll/imgcluster
Used for calculating image euclidean distance:
 Wang, L., Zhang, Y., & Feng, J. (2005). On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1334–1339. doi: 10.1109/tpami.2005.165
 Huang, G., Liu, Z., Maaten, L. V. D., & Weinberger, K. Q. (2017). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2017.243
 The CIFAR-10 dataset. (n.d.). Retrieved from https://www.cs.toronto.edu/~kriz/cifar.html
 Luma (video). (2019, July 3). Retrieved from https://en.wikipedia.org/wiki/Luma_(video)