At the end of August, I participated in Summer school on Bayesian methods for Deep Learning.
I think it a good reason to start writing for the blog finally :)
Deep|Bayes school was organized by Bayesian Methods Research Group of Higher School of Economics, Moscow, Russia. The program of the school covered a high variety of subjects in Deep Learning and Bayesian statistics as well as practices on such topics as VAE, GANs, Gaussian Processes and DL models with attention.
Some useful takeaway messages:
- One can add noise to the network weights and use variational dropout for
- reducing the variance of stochastic gradients 
- sparsifying the network up to 95% 
- Even good DNN models can suffer in case of randomly assigned labels: they will learn but not generalize . It seems that bayesian NN might be useful in this case (at least, they will refuse to train)
- Dropout is a standard technique for ensembling! In Lasagne, there is a simple parameter deterministic=False which allows you to get different predictions from the same network:
T.mean([lasagne.layers.get_output(net, deterministic=False) for i in range(10)], axis=0)If performing the Bayesian model selection, the dropout probability p can be selected from data.
- Any method which doesn't overfit is a wrong method. We need to gradually make it more and more complicated until it starts to overfit and then think of how to regularise it. Let’s overfit!
- Attention in deep models improves interpretability and provides better results. Attention may be interpreted as a latent variable. Attention is all you need :) 
- With the Bayesian framework, we can measure the uncertainty of the model and even differ between healthy and adversarial examples 
- Almost any prior can be added into the model as a latent variable. Unfortunately, only a few people know how to do it (I'm not among them).
Apart from the intense scientific program/content, it was very nice to meet people from industry and learn their cases of ML/DL usage. Of course, there were people from NLP and Computer Vision, but it was a big surprise for me to know, that security companies have some ill-formalized and non-routine ML tasks as well.
I would like to thank the organizers for such a great opportunity to learn and refresh many topics in deep learning and Bayesian statistics, as well as for a possibility to socialize with others. The organizers plan to make the next edition of the school in English so I strongly encourage everybody interested in participating.
Well, also it’s worth to mention, that organizers have a great sense of humor: it’s me with the reincarnation of Thomas Bayes, Dmitry Vetrov. And, yeah, it was quite deep :)
 Kingma, Diederik P., Tim Salimans, and Max Welling. "Variational dropout and the local reparameterization trick." Advances in Neural Information Processing Systems. 2015.
 Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. "Variational Dropout Sparsifies Deep Neural Networks." International Conference on Machine Learning (ICML 2017). 2017.
 Zhang, Chiyuan, et al. "Understanding deep learning requires rethinking generalization." arXiv preprint arXiv:1611.03530 (2016).
 Vaswani, Ashish, et al. "Attention Is All You Need." arXiv preprint arXiv:1706.03762 (2017).
 Li, Yingzhen, and Yarin Gal. "Dropout Inference in Bayesian Neural Networks with Alpha-divergences." International Conference on Machine Learning (ICML 2017). 2017.