How does deep learning determine the best depth?

Determining the optimal depth reduces the computational cost while further improving accuracy. For the problem of deep confidence network depth selection, the article analyzes the inadequacies of choosing the optimal depth by setting the threshold method. From the perspective of information theory, it is verified that the information entropy will reach convergence after each layer of Boltzmann machine (RBM) training reaches steady state, and the information entropy after convergence is used as the criterion for judging the optimal number of layers. Experiments by handwritten digit recognition found that the method can be used as the criterion for judging the optimal number of layers.

*Fund Project: Funded by the Natural Science Foundation of Fujian Province (2014J01234); Funded Project of Fujian Provincial Department of Education (JA15061)

The artificial neural network abstracts the human brain neural network from the perspective of information processing, establishes a simple model, and forms different networks according to different connection methods. Prior to 2006, most of the learning methods such as classification and regression were usually shallow learning models with a hidden layer. The limitation was that the ability to represent complex functions was limited in the case of finite samples and computational units. In 2006, the deep learning of the Deep Belief Network (DBN) proposed by Professor Hinton of the University of Toronto made another wave of artificial neural networks. The traditional shallow neural network randomly initializes the weights in the network, and it is easy to converge to the local minimum. In response to this problem, Professor Hinton proposed to use the unsupervised training method to initialize the weight first, and then determine the weight by inversely fine-tuning the weight to achieve better results. In addition, the time-based Recurrent Neural Network (RNN) proposed by Mikolov is mainly used for the prediction of sequence data and has a certain memory effect. Later, the DBN research extended some other variants, such as ConvoluTIonal Deep Belief Networks (CDBN).

At present, deep learning has achieved great success in the fields of speech recognition and computer vision.

However, research on deep learning has only begun in recent years. Modeling is one of the key issues. How to build a suitable depth model for different applications is a very challenging problem. DBN still uses empirical value method to judge the number of layers and its number of nodes selected by DBN. It is found that increasing the number of DBN layers to a certain peak, increasing the number of DBN layers again does not improve system performance, but instead leads to Training takes too long, which increases the cost of calculations.

In recent years, there have been some preliminary progresses in determining the number of DBN layers. High-strength uses the central limit theorem to prove the corresponding weight coefficient matrix after the Restricted Boltzmann Machine (RBM) training reaches steady state. The element obeys the normal distribution. As the number of layers increases, the weight coefficient matrix tends to be more normally distributed. The weight of the weight is closest to the normal distribution as the basis for determining the number of deep belief network layers. Find the normal distribution satisfaction rate to select the appropriate number of layers. Pan Guangyuan et al. used the threshold of the set reconstruction error to determine the number of layers. When the reconstruction error does not reach this threshold, a layer is added. Although the reconstruction error can reflect the likelihood of the RBM to the training data to a certain extent, Not completely reliable. It can be seen that the current method basically sets a threshold to judge, which may lead to a situation in which the threshold is not good. Based on the above situation, this paper proposes to use the information entropy of the hidden layer to determine the optimal number of layers after RBM training reaches steady state. When adding a layer of RBM, the information entropy will also increase. When the information entropy no longer increases, this paper selects This layer is used as the optimal number of layers.

1 Determination of the number of deep belief network layers

In 2006, Hinton et al. proposed a deep confidence neural network, which was superimposed by several RBMs. RBM is a two-layer model, which is a visible layer and a hidden layer. The training method of RBM is to first randomly initialize the visible layer, then perform Gibbs sampling between the visible layer and the hidden layer, and use the conditional probability distribution P(h| v) to get the hidden layer, then use P(v|h) to calculate the visible layer, repeat the process to balance the visible layer and the hidden layer. The goal of training the RBM network is to maximize the distribution of the calculated visible layer. Fit the distribution of the initial visible layer. With the training data as the initial state, the difference between the sample obtained after Gibbs sampling according to the distribution of RBM and the original data is the reconstruction error.

The training precision of RBM is introduced to increase with the increase of depth, and it is proved that the reconstruction error is positively correlated with the network energy, then a threshold is set for the value of the reconstruction error, and if it is not reached, a layer is added; The threshold is taken as the optimal number of layers. Through the final experiment, it can be found that although the fourth layer is selected as the optimal number of layers, the reconstruction error is still decreasing at the 5th and 6th layers. If the threshold selection is not good, although the reconstruction error can satisfy the threshold condition However, the structure obtained by selecting the number of layers does not achieve good results.

Therefore, this paper proposes to use the information entropy of the hidden hidden layer to determine the optimal number of layers. According to information theory, the physical meaning of information entropy indicates the average amount of information provided by the source after the source is output, and the average uncertainty of the source before the source is output. At the same time, the information entropy can also be said to be the order of the system. A metric, the more ordered a system, the lower the information entropy, and the higher the information entropy. The goal of training RBM is to make the system's energy function smaller and the order of the system. So after RBM training, the information entropy will converge to a smaller value.

Suppose the input matrix is ​​V=(v1,v2,v3,...,vi), and the output matrix after RBM training is Y=(y1, y2, y3,..., yj). The training model after RBM can be known. The visual node gets the value of the hidden node, ie:

P(Y)=S(WV+B)(1)

Where W is the weight matrix, B is the offset matrix, and S(x) is the activation function. The Sigmoid function is generally selected, namely:

The solution formula for information entropy is:

According to the algorithm of contrast divergence proposed by Hinton [13], the weights and offsets are updated according to the following formula:

Wi,j=wi,j+[P(hi=1|V(0))v(0)jP(hi=1|V(k))v(k)j](4)

Bi=bi+[P(hi=1|V(0))-P(hi=1|V(k))](5)

When the RBM training reaches the final state, the weights wi, j and the offset bi will gradually converge, and v is the input data, which is the determined value, so after the training reaches the final state, p(yi) will gradually converge, also The information entropy H(Y) converges to a smaller value.

After training one layer, the hidden layer is entered as the visible layer of layer 2 and begins training the layer 2 RBM. According to the average information amount of another physical meaning of information entropy, after the uncertainty is eliminated, the larger the information entropy, the more the amount of information obtained, the larger the amount of feature information extracted by the hidden layer. Therefore, when the information entropy no longer increases, the amount of information represented does not increase any more. Considering the RBM of each layer as a source, the information entropy of the last layer after RBM convergence should be larger than that of other layers. The amount of information in supervised learning will be greatest. So when the information entropy no longer increases, then the layer is selected as the optimal number of layers.

EVOD kit

Suizhou simi intelligent technology development co., LTD , https://www.msmvape.com

Posted on