Last week I was delighted to attend Systematic approaches to deep learning methods for audio workshop, organized by Erwin Schrödinger Institute in Vienna where I was presenting an ongoing work on the analysis of audio-visual correspondences.
The idea of the workshop was in bringing together mathematicians and machine learning researchers working on audio- and deep learning related problems. The following topics have been proposed for the discussion:
- Mathematical understanding of deep learning
- Introspection in deep learning
- End-to-end learning in MIR
- Signal representations in deep learners vs. adaptive signal transforms
- Scattering transforms and signal representations in deep learners
I would like to highlight some of the presented work and share my notes and feelings.
I could recap the first big question we discussed as follows: "Can we formalize something that already exists in data and impose this info into DNNs?" It seems that imposing domain knowledge to the networks is under a great and active discussion right now. Some examples here:
- Irene Waldspurger presented her work entitled "Inversion of the wavelet transform modulus." Her talk was about audio reconstruction from scalagrams, specifically, from Cauchy wavelets, as well as about scattering transforms and the possibility of using them as initialization for CNNs.
- The excellent talk on "Invariant and selective data representations with applications to Deep learning" given by Fabio Anselmi.
- Other notable presentations were given by Joakim Andén and Vincent Lostanlen. They discussed the use of joint time-scattering transform for CNNs and scattering on the pitch spiral. They proposed to use a hierarchical CNN where filters of the first few layers are fixed and presented in the form of multiple scattering transforms. The joint scattering has been shown to be time-shift invariant, frequency transposition invariant and robust to time-warping deformations.
To my great joy, the topic of probabilistic networks and deriving optimal architectures came to the discussion several times:
- Philipp Grohs discussed the variety of open theoretical questions in his talk "Deep Learning as a Mathematician" (you can find the slides here)
- Antoine Deleforge presented the work: "Reversed Mixture-of-Experts Networks for High- to Low-Dimensional Regression" about the low-dimensional estimation of high-dimensional data (for such tasks as sound source estimation or human pose estimation), and building inverse regression networks based on combining the mixture of experts with the final gating network.
Karen Ullrich gave a talk on Bayesian Networks with applications in sparcification and overconfidence evaluation.
- The presentation entitled "Bayesian meter tracking on learned signal representations" given by Andre Holzapfel was related to probabilistic post-processing of the results obtained with CNNs.
Another topic that resonates well with my work is understanding and interpretability of the learned data representations and networks. It has been brought to consideration in many of the talks in one form or another, but the following lectures were devoted to this topic entirely:
- Grégoire Montavon in his talk "Explaining the Predictions of Deep Neural Networks" presented several methods for network explanation such as Taylor decomposition and layer-wise relevance propagation (LRP). It's worth mentioning that they have a great online demo (http://heatmapping.org/). Also, they are organizing a workshop at NIPS on the topic of interpretability: http://www.interpretable-ml.org/nips2017workshop.
- Mishra Saumitra discussed the importance of local interpretability of the network predictions for music. His work is focused on extending LIME algorithm for music content analysis. The code of their SoundLIME system is available online (https://code.soundsoftware.ac.uk/projects/SoundLIME). He mentioned two workshops which are organized by QMUL in the following months: a more general Machine Learning for Sound and Music Information Retrieval at NIPS, and HORSE2017.
Last but not least, we had a great discussion on multimodality. In particular:
- Matthias Dorfer presented his work on audio-visual score following and retrieval;
- Oriol Nieto presented several enhancements for the cold-start problem in music recommendations (which is the work maintained by Sergio Oramas in collaboration with Pandora)
- Hendrik Koops shared their experience on user-oriented chord estimation
- and I was talking about my research on multimodal musical instrument recognition.
There are many useful presentations left out of the scope of this short review but notwithstanding were very interesting and of high scientific quality.
I would like to thank Monika and Arthur for organizing such a great event and inviting me, as well as my supervisors Emilia and Gloria for giving me an opportunity to participate. It was, undoubtedly, useful and extremely educational.
In the photo: the longest blackboard I've ever seen in my life :)