Deep Learning

Brief summary of the course

In the “Introduction to Deep Learning” mainstream DL approaches, methods, and applications are considered and discussed.

Course topics

12 Lectures 1.5 academic hours each + 2 home assignments, 22 hours of personal work each

 

Lecture 1. Training the Multilayer Perceptrons.

 

Multilayer Perceptrons, why they are important: connections with modern architectures (ConvNet/RNN),  historical issues. Why Kolmogorov-Cybenko Universal Approximation theorem works poorly for practice for shallow MLPs. General learning theory: frequentists and Bayesian approaches for learning, likelihood. Derivation of L2-regularized mean squared error for curve fitting from the Bayesian approach. Multilayer perceptron: forward pass in scalar form. Activation functions: a brief review. Derivation of backpropagation. Derivatives for L2 regularization. MLP for classification, one-hot-encoding, softmax, cross-entropy loss function. The numerical efficiency of backpropagation, comparison to the straightforward approach. Forward propagation/backpropagation in vector form.

 

Lecture 2. Deep Convolutional Neural Networks.

 

Convolutional Neural Networks, why they are important: modern achievements, historical/biological issues. Cortical receptive fields: classic experiments of Huber & Viesel, hierarchical organization. Breaking ILSVRC 2012 results. Comparison of MLP and CNN architectures. Convolutional net blocks: convolution, pooling, non-linear activation function, fully-connected layers. Convolutional functions 1D/2D: detailed discussion, examples. Explanation of stride and zero-padding. Pooling layer. Implementation tricks: im2col/col2im, transformation from ConvNet to MLP and back. Overview and discussion of mainstream ConvNet architectures: AlexNet, Network in Network, VGG, ResNet, SqueezeNet.

 

Lecture 3. Regularization & Optimization for Deep Learning.

 

Regularization, the idea, historical issues. Model capacity control: AIC criteria. Early stopping. Weight decay: L2/L1 variants, Maximum Posterior Approach. Ensemble methods (“committee of experts”). Simpson’s paradox. Ensemble model: averaging (“bootstrap”). Dropout and its relationship to ensemble methods.  Injecting noise to the output targets. Gradient optimization of the error function, using the derivatives for gradient descent, three types of critical points. A problem of the poor local minimum. Introducing the Jacobian and Hessian matrices. Derivation of Newton’s method for vector form. Batch gradient descent, stochastic gradient descent, using the momentum. Nesterov accelerated gradient. Adaptive learning methods: AdaDelta, AdaGrad, RMSProp, ADAM.

 

Lecture 4. Vanishing Gradients Effect. Recurrent Neural Networks.

 

Comparison of classic feedforward models (before 2006) and deep models. Backpropagation mechanics in vector form. Norm of vectors and matrices. Backpropagation mechanics as a product of Jacobians. Relationship of vanishing gradients and activation function saturation. Motivation to go deeper: overview ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners 2011-2015. Smart initialization: derivation of  Xavier (Glorot) approach. Batch normalization. Orthogonal approach: orthogonal matrices, backpropagation flow revisited. Smart orthogonal initialization. Dynamic Neural Networks: feedforward plus time delay bank, recurrent models, hybrid models. Modes of RNN’s work. Backpropagation Through Time (BPTT) for Recurrent Network. Simple Recurrent Neural Network (SRN). SRN Forward Dynamics. Simple Recurrent Neural Network Backpropagation Through Time. Effect of different weights initializations for SRN. Long-Short Term Memory (LSTM): basic idea to avoid vanishing gradients effect using linear connections between states, overview step-by-step. Back to convnets: Deep Residual Networks adding linear connections to avoid vanishing gradients effect.

 

Lecture 5. Learning the Representations.

 

Learning of Representations (Features). Example of End-to-end learning: PCA. Principal Component Analysis derivation. Unsupervised pretraining: case study. Autoencoders: basic idea, the “bottleneck” example. Undercomplete, sparse, denoising autoencoders. Unsupervised greedy layer-wise pretraining. Hierarchy of trained representations for visual recognition. Grandmother’s neuron example.  Visualizing CNN features: (Guided) Backprop. Pretraining as a “smart” initialization. Deep, big, simple neural nets (Schmidhuber, 2012): no pre-training, simple backpropagation +  gradient descent. Case study: Deep Face (Facebook, 2014), Supervised pre-training for face recognition.  Databases for Deep Face pre-training. Transfer Learning: CNNs as feature extractors. Caffe Model Zoo. Learned representations for making art. Visualizing CNN features: Gradient Ascent. DeepDream: Amplify existing features.

 

Lecture 6. Deep Learning for Natural Language Processing.

 

Sequential and tree-like data structures in NLP. Traditional feature vectors for texts: structural, lexical, syntactic and other features.  Training the representations for texts: word2vec. Problems of one-hot-encoded raw word features. Distributional Hypothesis. word2vec, two basic neural network models: continuous bag-of-words (CBOW) and skip-gram. Multi-task learning for NLP. Neural Machine Translation (NMT). Sequence-to-sequence model, RNN encoder-decoder. Beam search. Attention mechanism for NMT. Attention beyond NMT: generation descriptions by images.

 

Lecture 7. Neural Generative Models.

 

Discriminative and generative neural models, explicit/implicit density estimation, tractable/approximate density estimation. Pixel CNNs/Pixel RNNs. Variational autoencoders. Forward and Reverse Kullback-Leibler Divergence. Generative Adversarial Networks. Training GANs – two player game. Convolutional Architectures for GANs. Conditional GANs, supervised / unsupervised modes. pix2pix, CycleGAN. 

 

Lecture 8. Neural Networks for Control.

 

AI approach for control, the difference between static and dynamic systems. Control of dynamic plants, examples of states. Case study: what if inputs u are absent? Case study: what if outputs y are absent? Modeling of forward/inverse dynamics of plants. Direct Inverse Neurocontrol. Model Reference Adaptive Neurocontrol. Backpropagation Through Time for Neurocontrol (Danil Prokhorov’s models). Cascade training of differentiable models, Generative Adversarial Nets as a side example. Approximate Dynamic Programming, Bellman’s principle of optimality. Straightforward solution: Model Predictive Control. ADP: Policy-Value iteration. Adaptive Critics. ADP designs: Heuristic Dynamic Programming (HDP), Dual HDP (DHP), Globalized DHP (GDHP), and Action-Dependent (AD) designs.

 

Lecture 9-12 – Guest lectures from top speakers from the academy and industry.

 

Home assignments: Two home assignments, 22 hours each.

Prerequisites