Effects of Sparse Initialization in Deep Belief Networks
Keywords:Sparse Initialization, Deep Belief Networks, Noisy Rectified Linear Units
AbstractDeep neural networks are often trained in two phases: first hidden layers are pretrained in an unsupervised manner and then network is fine-tuned with error backpropagation. Pretraining is often carried out using Deep Belief Networks (DBNs), with initial weights set to small random values. However, recent results established that well-designed initialization schemes, e.g. Sparse Initialization (SI), can greatly improve performance of networks that do not use pretraining. An interesting question arising from these results is whether such initialization techniques wouldn't also improve pretrained networks? To shed light on this question, in this work we evaluate SI in DBNs that are used to pretrain discriminative networks. The motivation behind this research is our observation that SI has an impact on the features learned by a DBN during pretraining. Our results demonstrate that this improves network performance: when pretraining starts from sparsely initialized weight matrices networks achieve lower classification error after fine-tuning.
Bengio Y.: Practical Recommendations for Gradient-Based Training of Deep Architectures. In: G. Montavon, G.B. Orr, K.R. Müller, eds, Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, vol. 7700, pp. 437–478. Springer, Berlin–Heidelberg, 2012.
Bergstra J., Bengio Y.: Random Search for Hyper-parameter Optimization. Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012.
Bridle J.S.: Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F. Soulié, J. Hérault, eds, Neurocomputing, NATO ASI Series, vol. 68, pp. 227–236. Springer, Berlin–Heidelberg, 1990.
Glorot X., Bengio Y.: Understanding the difficulty of training deep feedforward neural networks. In: Y.W. Teh, M. Titterington, eds, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, vol. 9, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
Hinton G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation, vol. 14(8), pp. 1771–1800, 2002.
Hinton G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In: G. Montavon, G.B. Orr, K.R. M ̈uller, eds, Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, vol. 7700, pp. 599–619. Springer, Berlin–Heidelberg, 2012.
Hinton G.E., Salakhutdinov R.R.: Reducing the dimensionality of data with neural networks. Science, vol. 313(5786), pp. 504–507, 2006.
LeCun Y., Bottou L., Bengio Y., Haffner P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86(11), pp. 2278–2324, 1998.
LeCun Y., Huang F.J., Bottou L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR04), vol. 2, pp. II–97. IEEE, 2004.
Martens J.: Deep learning via Hessian-free optimization. In: J. Fürnkranz, T. Joachims, eds, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 735–742. Omnipress, 2010.
Nair V., Hinton G.E.: Rectified Linear Units Improve Restricted Boltzmann Machines. In: J. F ̈urnkranz, T. Joachims, eds, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814. Omnipress, 2010.
Nesterov Y.: A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady, vol. 27(2), pp. 372–376, 1983.
Polyak B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, vol. 4(5), pp. 1–17, 1964.
Rumelhart D.E., Hinton G.E., Williams R.J.: Learning representations by back-propagating errors. Nature, vol. 323(6088), pp. 533–536, 1986.
Smolensky P.: Information Processing in Dynamical Systems: Foundations of Harmony Theory. In: D.E. Rumelhart, J.L. McClelland, CORPORATE PDP Research Group, eds, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp. 194–281. MIT Press, 1986.
Srivastava N.: Improving neural networks with dropout. Master’s thesis, University of Toronto, 2013.
Srivastava N., Hinton G.E., Krizhevsky A., Sutskever I., Salakhutdinov R.: Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, vol. 15(1), pp. 1929–1958, 2014.
Sutskever I., Martens J., Dahl G., Hinton G.E.: On the importance of initialization and momentum in deep learning. In: S. Dasgupta, D. Mcallester, eds, Proceedings of the 30th International Conference on Machine Learning (ICML-13), vol. 28, pp. 1139–1147. JMLR Workshop and Conference Proceedings, 2013.