Machine Learning Questions

Classical Machine Learning
Dimensionality Reduction
Machine Learning Workflows
- Basics
- Sampling and Creating Training Data
Objective Functions and Performance Metrices

Classical Machine Learning

What are the basic assumptions to be made for linear regression?
What happens if we don’t apply feature scaling to logistic regression?
What are the algorithms you’d use when developing the prototype of a fraud detection model?
Feature selection.
1. Why do we use feature selection?
2. What are some of the algorithms for feature selection? Pros and cons of each.
k-means clustering.
1. How would you choose the value of k?
2. If the labels are known, how would you evaluate the performance of your k-means clustering algorithm?
3. How would you do it if the labels aren’t known?
4. Given the following dataset, can you predict how K-means clustering works on it? Explain.
k-nearest neighbor classification.
1. How would you choose the value of k?
2. What happens when you increase or decrease the value of k?
3. How does the value of k impact the bias and variance?
k-means and GMM are both powerful clustering algorithms.
1. Compare the two.
2. When would you choose one over another?
Bagging and boosting are two popular ensembling methods. Random forest is a bagging example while XGBoost is a boosting example.
1. What are some of the fundamental differences between bagging and boosting algorithms?
2. How are they used in deep learning?
Given this directed graph.
1. Construct its adjacency matrix.
2. How would this matrix change if the graph is now undirected?
3. What can you say about the adjacency matrices of two isomorphic graphs?
Imagine we build a user-item collaborative filtering system to recommend to each user items similar to the items they’ve bought before.
1. You can build either a user-item matrix or an item-item matrix. What are the pros and cons of each approach?
2. How would you handle a new user who hasn’t made any purchases in the past?
Is feature scaling necessary for kernel methods?
Naive Bayes classifier.
1. How is Naive Bayes classifier naive?
2. Let’s try to construct a Naive Bayes classifier to classify whether a tweet has a positive or negative sentiment. We have four training samples:
$$\begin{bmatrix} Tweet & Label \\\ This makes me so upset & Negative\\\ This puppy makes me happy & Positive \\\ Look at this happy hamster & Positive \\\ No hamsters allowed in my house & Negative \end{bmatrix}$$
According to your classifier, what's sentiment of the sentence The hamster is upset with the puppy?
Two popular algorithms for winning Kaggle solutions are Light GBM and XGBoost. They are both gradient boosting algorithms.
1. What is gradient boosting?
2. What problems is gradient boosting good for?
SVM.
1. What’s linear separation? Why is it desirable when we use SVM?
1. How well would vanilla SVM work on this dataset?
1. How well would vanilla SVM work on this dataset?
1. ow well would vanilla SVM work on this dataset?

Dimensionality Reduction

Why do we need dimensionality reduction?
Eigendecomposition is a common factorization technique used for dimensionality reduction. Is the eigendecomposition of a matrix always unique?
Name some applications of eigenvalues and eigenvectors.
We want to do PCA on a dataset of multiple features in different ranges. For example, one is in the range $0-1$ and one is in the range $10 - 1000$. Will PCA work on this dataset?
Under what conditions can one apply eigendecomposition? What about SVD?
1. What is the relationship between SVD and eigendecomposition?
2. What’s the relationship between PCA and SVD?
How does $t-SNE$ (T-distributed Stochastic Neighbor Embedding) work? Why do we need it?

Machine Learning Workflows

Basics

Explain supervised, unsupervised, weakly supervised, semi-supervised, and active learning.
Empirical risk minimization.
1. What’s the risk in empirical risk minimization?
2. Why is it empirical?
3. How do we minimize that risk?
Occam's razor states that when the simple explanation and complex explanation both work equally well, the simple explanation is usually correct. How do we apply this principle in ML?
What are the conditions that allowed deep learning to gain popularity in the last decade?
If we have a wide NN and a deep NN with the same number of parameters, which one is more expressive and why?
The Universal Approximation Theorem states that a neural network with 1 hidden layer can approximate any continuous function for inputs within a specific range. Then why can’t a simple neural network reach an arbitrarily small positive error?
What are saddle points and local minima? Which are thought to cause more problems for training large NNs?
Hyperparameters.
1. What are the differences between parameters and hyperparameters?
2. Why is hyperparameter tuning important?
3. Explain algorithm for tuning hyperparameters.
Classification vs. regression.
1. What makes a classification problem different from a regression problem?
2. Can a classification problem be turned into a regression problem and vice versa?
Parametric vs. non-parametric methods.
1. What’s the difference between parametric methods and non-parametric methods? Give an example of each method.
2. When should we use one and when should we use the other?
Why does ensembling independently trained models generally improve performance?
Why does L1 regularization tend to lead to sparsity while L2 regularization pushes weights closer to 0?
Why does an ML model’s performance degrade in production?
What problems might we run into when deploying large machine learning models?
Your model performs really well on the test set but poorly in production.
1. What are your hypotheses about the causes?
2. How do you validate whether your hypotheses are correct?
3. Imagine your hypotheses about the causes are correct. What would you do to address them?

Sampling and Creating Training Data

If you have 6 shirts and 4 pairs of pants, how many ways are there to choose 2 shirts and 1 pair of pants?
What is the difference between sampling with vs. without replacement? Name an example of when you would use one rather than the other?
Explain Markov chain Monte Carlo sampling.
If you need to sample from high-dimensional data, which sampling method would you choose?
Suppose we have a classification task with many classes. An example is when you have to predict the next word in a sentence -- the next word can be one of many, many possible words. If we have to calculate the probabilities for all classes, it’ll be prohibitively expensive. Instead, we can calculate the probabilities for a small set of candidate classes. This method is called candidate sampling. Name and explain some of the candidate sampling algorithms.
Suppose you want to build a model to classify whether a Reddit comment violates the website’s rule. You have $10$ million unlabeled comments from $10K$ users over the last $24$ months and you want to label $100K$ of them.
1. How would you sample $100K$ comments to label?
2. Suppose you get back $100K$ labeled comments from $20$ annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them?
Suppose you work for a news site that historically has translated only $1%$ of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument?
How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution?
How do you know you’ve collected enough samples to train your ML model?
How to determine outliers in your data samples? What to do with them?
Sample duplication
1. When should you remove duplicate training samples? When shouldn’t you?
2. What happens if we accidentally duplicate every data point in your train set or in your test set?
Missing data
1. In your dataset, two out of 20 variables have more than 30% missing values. What would you do?
2. How might techniques that handle missing data make selection bias worse? How do you handle this bias?
Why is randomization important when designing experiments (experimental design)?
Class imbalance.
1. How would class imbalance affect your model?
2. Why is it hard for ML models to perform well on data with class imbalance?
3. Imagine you want to build a model to detect skin legions from images. In your training dataset, only $1%$ of your images shows signs of legions. After training, your model seems to make a lot more false negatives than false positives. What are some of the techniques you'd use to improve your model?
Training data leakage.
1. Imagine you're working with a binary task where the positive class accounts for only 1% of your data. You decide to oversample the rare class then split your data into train and test splits. Your model performs well on the test split but poorly in production. What might have happened?
2. You want to build a model to classify whether a comment is spam or not spam. You have a dataset of a million comments over the period of 7 days. You decide to randomly split all your data into the train and test splits. Your co-worker points out that this can lead to data leakage. How?
How does data sparsity affect your models?
Feature leakage
1. What are some causes of feature leakage?
2. Why does normalization help prevent feature leakage?
3. How do you detect feature leakage?
Suppose you want to build a model to classify whether a tweet spreads misinformation. You have 100K labeled tweets over the last 24 months. You decide to randomly shuffle on your data and pick 80% to be the train split, 10% to be the valid split, and 10% to be the test split. What might be the problem with this way of partitioning?
You’re building a neural network and you want to use both numerical and textual features. How would you process those different features?
Your model has been performing fairly well using just a subset of features available in your data. Your boss decided that you should use all the features available instead. What might happen to the training error? What might happen to the test error?

Objective Functions and Performance Metrices

Convergence.
1. When we say an algorithm converges, what does convergence mean?
2. How do we know when a model has converged?
Draw the loss curves for overfitting and underfitting.
Bias-variance trade-off
1. What’s the bias-variance trade-off?
2. How’s this tradeoff related to overfitting and underfitting?
3. How do you know that your model is high variance, low bias? What would you do in this case?
4. How do you know that your model is low variance, high bias? What would you do in this case?
Cross-validation.
1. Explain different methods for cross-validation.
2. Why don’t we see more cross-validation in deep learning?
Train, valid, test splits.
1. What’s wrong with training and testing a model on the same data?
2. Why do we need a validation set on top of a train set and a test set?
3. Your model’s loss curves on the train, valid, and test sets look like this. What might have been the cause of this? What would you do?
Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their X-ray scan. Your colleague announces that the problem is solved now that they’ve built a system that can predict with 99.99% accuracy. How would you respond to that claim?
F1 score.
1. What’s the benefit of F1 over the accuracy?
2. Can we still use F1 for a problem with more than two classes. How?
Given a binary classifier that outputs the following confusion matrix.

$$\begin{bmatrix} "" & Predicted True & Predicted False \\\ Actual True & 30 & 20\\\ Actual False & 5 & 40 \\\ \end{bmatrix}$$

1. Calculate the model’s precision, recall, and F1.
1. What can we do to improve the model’s performance?

Consider a classification where $99%$ of data belongs to class A and $1%$ of data belongs to class B.
1. If your model predicts A 100% of the time, what would the F1 score be? Hint: The F1 score when A is mapped to 0 and B to 1 is different from the F1 score when A is mapped to 1 and B to 0.
2. If we have a model that predicts A and B at a random (uniformly), what would the expected $F_1$ be?
For logistic regression, why is log loss recommended over MSE (mean squared error)?
When should we use RMSE (Root Mean Squared Error) over MAE (Mean Absolute Error) and vice versa?
Show that the negative log-likelihood and cross-entropy are the same for binary classification tasks.
For classification tasks with more than two labels (e.g. MNIST with $10$ labels), why is cross-entropy a better loss function than MSE?
Consider a language with an alphabet of $27$ characters. What would be the maximal entropy of this language?
A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distribution of the data and Q is the distribution learned by our model. How do measure how close Q is to P?
MPE (Most Probable Explanation) vs. MAP (Maximum A Posteriori)
1. How do MPE and MAP differ?
2. Give an example of when they would produce different results.
Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the predicted price should never be off more than $10%$ from the actual price. Which metric would you use?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine Learning Questions

Contents

Classical Machine Learning

Dimensionality Reduction

Machine Learning Workflows

Basics

Sampling and Creating Training Data

Objective Functions and Performance Metrices

FilesExpand file tree

machine_learning_questions.md

Latest commit

History

machine_learning_questions.md

File metadata and controls

Machine Learning Questions

Contents

Classical Machine Learning

Dimensionality Reduction

Machine Learning Workflows

Basics

Sampling and Creating Training Data

Objective Functions and Performance Metrices