- Training a Neural Network
- Natural Language Processing
- Coputer Vision
- Reinforcement Learning
- Advanced Topics
- When building a neural network, should you overfit or underfit it first?
- Write the vanilla gradient update.
- Neural network in simple Numpy.
- Write in plain NumPy the forward and backward pass for a two-layer feed-forward neural network with a ReLU layer in between.
- Implement vanilla dropout for the forward and backward pass in NumPy.
- Activation functions.
- Draw the graphs for sigmoid, tanh, ReLU, and leaky ReLU.
- Pros and cons of each activation function.
- Is ReLU differentiable? What to do when it’s not differentiable?
- Derive derivatives for sigmoid function when is a vector.
- What’s the motivation for skip connection in neural works?
- Vanishing and exploding gradients.
- How do we know that gradients are exploding? How do we prevent it?
- Why are RNNs especially susceptible to vanishing and exploding gradients?
- Weight normalization separates a weight vector’s norm from its gradient. How would it help with training?
- When training a large neural network, say a language model with a billion parameters, you evaluate your model on a validation set at the end of every epoch. You realize that your validation loss is often lower than your train loss. What might be happening?
- What criteria would you use for early stopping?
- Gradient descent vs SGD vs mini-batch SGD.
- It’s a common practice to train deep learning models using epochs: we sample batches from data without replacement. Why would we use epochs instead of just sampling data with replacement?
- Your model’ weights fluctuate a lot during training. How does that affect your model’s performance? What to do about it?
- Learning rate.
- Draw a graph number of training epochs vs training error for when the learning rate is:
- too high
- too low
- acceptable.
- What’s learning rate warmup? Why do we need it?
- Draw a graph number of training epochs vs training error for when the learning rate is:
- Compare batch norm and layer norm.
- Why is squared L2 norm sometimes preferred to L2 norm for regularizing neural networks?
- Some models use weight decay: after each gradient update, the weights are multiplied by a factor slightly less than 1. What is this useful for?
- It’s a common practice for the learning rate to be reduced throughout the training.
- What’s the motivation?
- What might be the exceptions?
- Batch size.
- What happens to your model training when you decrease the batch size to 1?
- What happens when you use the entire training data in a batch?
- How should we adjust the learning rate as we increase or decrease the batch size?
- Why is Adagrad sometimes favored in problems with sparse gradients?
- Adam vs. SGD.
- What can you say about the ability to converge and generalize of Adam vs. SGD?
- What else can you say about the difference between these two optimizers?
- With model parallelism, you might update your model weights using the gradients from each machine asynchronously or synchronously. What are the pros and cons of asynchronous SGD vs. synchronous SGD?
- Why shouldn’t we have two consecutive linear layers in a neural network?
- Can a neural network with only RELU (non-linearity) act as a linear classifier?
- Design the smallest neural network that can function as an XOR gate.
- Why don’t we just initialize all weights in a neural network to zero?
- Stochasticity.
- What are some sources of randomness in a neural network?
- Sometimes stochasticity is desirable when training neural networks. Why is that?
- Dead neuron.
- What’s a dead neuron?
- How do we detect them in our neural network?
- How to prevent them?
- Pruning.
- Pruning is a popular technique where certain weights of a neural network are set to 0. Why is it desirable?
- How do you choose what to prune from a neural network?
- Under what conditions would it be possible to recover training data from the weight checkpoints?
- Why do we try to reduce the size of a big trained model through techniques such as knowledge distillation instead of just training a small model from the beginning?
-
RNNs
- What’s the motivation for RNN?
- What’s the motivation for LSTM?
- How would you do dropouts in an RNN?
-
What’s density estimation? Why do we say a language model is a density estimator?
-
Language models are often referred to as unsupervised learning, but some say its mechanism isn’t that different from supervised learning. What are your thoughts?
-
Word embeddings.
- Why do we need word embeddings?
- What’s the difference between count-based and prediction-based word embeddings?
- Most word embedding algorithms are based on the assumption that words that appear in similar contexts have similar meanings. What are some of the problems with context-based word embeddings?
-
Given 5 documents:
D1: The duck loves to eat the worm D2: The worm doesn’t like the early bird D3: The bird loves to get up early to get the worm D4: The bird gets the worm from the early duck D5: The duck and the birds are so different from each other but one thing they have in common is that they both get the worm1. Given a query Q: “The early bird gets the worm”, find the two top-ranked documents according to the TF/IDF rank using the cosine similarity measure and the term set {bird, duck, worm, early, get, love}. Are the top-ranked documents relevant to the query? 2. Assume that document D5 goes on to tell more about the duck and the bird and mentions “bird” three times, instead of just once. What happens to the rank of D5? Is this change in the ranking of D5 a desirable property of TF/IDF? Why? -
Your client wants you to train a language model on their dataset but their dataset is very small with only about 10,000 tokens. Would you use an n-gram or a neural language model?
-
For n-gram language models, does increasing the context length (n) improve the model’s performance? Why or why not?
-
What problems might we encounter when using softmax as the last layer for word-level language models? How do we fix it?
-
What's the Levenshtein distance of the two words “doctor” and “bottle”?
-
BLEU is a popular metric for machine translation. What are the pros and cons of BLEU?
-
On the same test set, LM model A has a character-level entropy of 2 while LM model A has a word-level entropy of 6. Which model would you choose to deploy?
-
Imagine you have to train a NER model on the text corpus A. Would you make A case-sensitive or case-insensitive?
-
Why does removing stop words sometimes hurt a sentiment analysis model?
-
Many models use relative position embedding instead of absolute position embedding. Why is that?
-
Some NLP models use the same weights for both the embedding layer and the layer just before softmax. What’s the purpose of this?
- For neural networks that work with images like VGG-19, InceptionNet, you often see a visualization of what type of features each filter captures. How are these visualizations created?
- Filter size.
- How are your model’s accuracy and computational efficiency affected when you decrease or increase its filter size?
- How do you choose the ideal filter size?
- Convolutional layers are also known as “locally connected.” Explain what it means.
- When we use CNNs for text data, what would the number of channels be for the first conv layer?
- What is the role of zero padding?
- Why do we need upsampling? How to do it?
- What does a 1x1 convolutional layer do?
- Pooling.
- What happens when you use max-pooling instead of average pooling?
- When should we use one instead of the other?
- What happens when pooling is removed completely?
- What happens if we replace a 2 x 2 max pool layer with a conv layer of stride 2?
- When we replace a normal convolutional layer with a depthwise separable convolutional layer, the number of parameters can go down. How does this happen? Give an example to illustrate this.
- Can you use a base model trained on ImageNet (image size 256 x 256) for an object classification task on images of size 320 x 360? How?
- How can a fully-connected layer be converted to a convolutional layer?
- Pros and cons of FFT-based convolution and Winograd-based convolution.
-
Explain the explore vs exploit tradeoff with examples.
-
How would a finite or infinite horizon affect our algorithms?
-
Why do we need the discount term for objective functions?
-
Fill in the empty circles using the minimax algorithm.
-
Fill in the alpha and beta values as you traverse the minimax tree from left to right.
-
Given a policy, derive the reward function.
-
Pros and cons of on-policy vs. off-policy.
-
What’s the difference between model-based and model-free? Which one is more data-efficient?
- An autoencoder is a neural network that learns to copy its input to its output. When would this be useful?
- Self-attention.
- What’s the motivation for self-attention?
- Why would you choose a self-attention architecture over RNNs or CNNs?
- Why would you need multi-headed attention instead of just one head for attention?
- How would changing the number of heads in multi-headed attention affect the model’s performance?
- Transfer learning
- You want to build a classifier to predict sentiment in tweets but you have very little labeled data (say 1000). What do you do?
- What’s gradual unfreezing? How might it help with transfer learning?
- Bayesian methods.
- How do Bayesian methods differ from the mainstream deep learning approach?
- How are the pros and cons of Bayesian neural networks compared to the mainstream neural networks?
- Why do we say that Bayesian neural networks are natural ensembles?
- GANs.
- What do GANs converge to?
- Why are GANs so hard to train?