Machine learning is improving in a task T, w.r.t. performance measure P,
based on experience E.
We try to fit a linear line to some data set, making these assumptions:
- There is a relationship between the variables
- This relationship is linear
- This relationship will hold in the future
We have some inputs X = {x₁, x₂, ..., xn} and their corresponding labels T = {t₁, t₂, ... tn}. We want to predict the t for any x.
We can approximate the data using a linear equation:
t = mx + c
Where m and c are learnt variables.
We need to derive m and c (a.k.a w₁ and w₂) that minimises a loss
function.
We can vectorize this by converting m and c into a vector of weights W = {c, m} and the input becomes X = {1, x}. Therefore, we can do t = W.t ⋅ X.
This has the benefit of us being able to add more input attributes to X
without having to restructure our algorithms.
A function that estimates how poorly our model performs on the dataset.
We use here squared difference:
L = avg((t - t')²)
Where t is the ground truth, and t' is our prediction.
Try to maximise the likelihood of the prediction being correct
L = p(t|X, w, σ²) = mult(N(w.t⋅xₙ, σ²) for xₙ in X)
where
Xis the input datawis the learnt weightsσ²is the learnt standard deviation
Deriving these weights gives us the same results as least squares
We need to find dW/dL = 0, i.e. what values of W give use the minimum value
for our loss function
TODO: How do we know that this is the global minima? And not just a local minima/maxima?
We can derive from the loss function L the minimum value is given by:
W' = (X.t ⋅ X)⁻¹ * X.t ⋅ t
Where W' is our approximation for the perfect values of W.
We can do this by adding extra variables to our X input vector.
If we add x², x³, etc. to X then we can allow the model to make non-linear
predictions from a single input value x.
Therefore, X becomes [x⁰, x¹, x², ..., xn] or [1, x, x², ..., xn] for some
complexity n.
The more we increase n, the more our model can fit the data exactly. This has
two main effects:
- The model becomes very good at predicting data it has seen before
- The model becomes very bad at predicting data it hasn't seen before
This is called overfitting, and should be avoided. Therefore, we want to pick an
n that is large enough to model the complexity of the data, but not too big to
overfit it.
Regularising is a technique to stop a model from overfitting its training data.
One way to regularise is append another term to the cost function:
L' = L + λσ(W)
Where λ is the weighting of our regularisation, and σ is some function that
combines the values of W. For L1 regularisation, we have σ(W) = avg(abs(W)), and for L2 regularisation, we have σ(W) = abs(W²).
Intuition: Model complexity (thus overfitting) increases as W increases. If we
punish high values of W, then the model will only increase W (and therefore
complexity), if it is worth the gain in the original loss function.
- Add some random noise to make linear predictions look like real data
- Use additive noise:
tn = f(xn; W) + ε- where
p(ε) = N(0, σ²) - Model now determined by
Wandσ
- where
- For each
(t, x)we have a likelihood oftbeing observed, given by:p(t | x, W, σ²)- We now want to maximise this function
- TODO: Go over the maths for this part
avg(d(truth[n] == prediction[n]))
where
d(b: bool) = 1 if b else 0
Can be used to get classification accuracy
- Create a table of True Positive, True Negative, False Positive, False Negative
- Can be expanded for >2 classes
Sensitivity = TP / (TP + FN) = TP / (All Positive)
Specificity = TN / (TN + FP) = TN / (All Negative)
- Plot sensitivity and complementary specificity (
1 - Spec) on a graph - Modify the model in some way (e.g. change threshold)
- Check with modification has a point in the ROT graph closest to the top left corner (i.e. highest sensitivity and specificity)
- Get the area under the ROC curve
- Higher is better
- That way we get a performance metric that is independent of the variable we changed during ROC analysis (e.g. we might change threshold at run time)
Do one-against-all ROC analyses, so if you have 3 classes, you have 3 ROC graphs
- Take a percentage of the dataset and don't show it to the model (e.g. 80/20)
- Reduce training data
- Reduce representativeness, as you've lost some data
- Separate data into
Ksets - Train once with
K-1of the sets, validated on the remaining set - Repeat, cycling the validation set
Special case of k-fold where K = N
- No training, just store training examples and use those
- Lazy, does all work on inference
- Local approximation / generalisation
- Store all
Ntraining examples - Choose parameter
k < N - Find
knearest instances in training set - Find the most common class in these
knearest - In case of tie, assign randomly from tied
- Same as above, but average
knearest labels - Could use any distance measurement: Euclidean, Manhattan, cosine angle
- Cosine angle is measuring the difference in angles, so we essentially ignore the "magnitude" of the vectors
- Class imbalance, if you have a class with only 5 instances, and a
kof 11, you will never classify this class - Classification performance cost is high, even though training performance cost is non-existent
- There may be irrelevant attributes in the dataset
- Not too much data attributes (~20)
- Because kNN takes into account each variable with equal weighting, and more attributes implies more attributes that don't mean anything
- Lots of training data
- Can learn complex target functions (lots of wiggly class boundaries)
- Multiply the effect of each label by the distance to that instance
- Generalisation vs overfitting
- Error of a model consists of two components:
M = B² + VB²is mismatch between model and actual process that generated dataVis the variance, a more complex model has higher variance
- Non-prob models give hard classifications,
C=1, C=2... - While prob models give percentage likeliness,
C=10% 1, 90% 2- Provides a lot more information
- Important when the cost of misclassification is high
P(x|y) = P(y|x) * P(x) / P(y)
P(x)prior belief: probability ofxoccurring, without any other dataP(y|x)class-conditional likelihood: prob ofyoccurring givenxoccurredP(y)data evidence: marginal probability of the observed datay= Σ(P(y && x))for allx= Σ(P(y|x)P(x))for allx
P(x|y)posterior probability: probability ofxoccurring after seeing the new datay
- Pick the class that optimises the posterior probability
- For "maximum a posterior hypothesis":
argmax_c P(c|x)wherecis the class, andxis the observed data= argmax_c P(x|c) * P(c) / P(x)= argmax_c P(x|c) * P(c)becauseP(x)is independent ofc, and can be omitted
- For "maximum likelihood hypothesis":
- We omit
P(c), therefore assuming all classes are equally likely to occur - The equation becomes:
argmax_c P(x|c)
- We omit
- We're now classifying a list of data
X = {x₁, x₂...} - Maximum a posterior now becomes:
argmax_c P(x₁, x₂... | c) * P(c)
- The naive assumption is that all in
Xare conditionally independent, thus:P(X|c) = P(x₁|c) * P(x₂|c) * ...- Although, theoretically should be violated...
- This assumption simplifies process, and works well in practice
- Example:
- If we have attributes "has runny nose", "has temperature", and want to predict "has flu", the equation becomes:
P(flu|RN, TMP) = P(RN|flu) * P(TMP|flu) * P(flu)
- We calculate mean (
μ) and stddev (σ²) for each attribute inX - Then,
P(x|μ, σ²)is the Gaussian formula - Then, we can calculate the MAP and ML hypotheses
- We can do multi-variate Gaussian where
μis a vector andΣis a matrix, this is non-naive - We can also use naive bayes assumption here to deal with small amounts of training data
- Given some data
X = {x₁, x₂...}, can we group similar ones together? - Similarity
- We measure similarity as closeness, i.e.
(x₁ - x₂)²
- We measure similarity as closeness, i.e.
- We want to find a set
Kparameters - Each cluster is represented by it's center
- Say cluster
μcontainsX = {x₁, x₂..} - We represent it as the average of
X
- Say cluster
- Algorithm:
- Randomly initialise
KclustersC = {μ₁, μ₂...} - Assign each of
Xto its closest cluster inC - Update each cluster in
Cto be represented by the average of its new members - If the members of each
Cdo not change, finish
- Otherwise, go to step 2
- Randomly initialise
- Alternate algorithm:
- Randomly set all in
Xto a random cluster inC
Coriginally has no center untilXs are assigned to it
- Update each cluster in
Cto be the average of its assigned elements - Continue from step 2 in original algorithm
- Randomly set all in
- We can evaluate the performance of a run of K-means by calculating the
intra-group distance estimate:
D = avg(for c in C, for x in c: (x - c)²)
- K-means will find local-minima for
D- We can find a better
Dby re-running K-means several times, and picking the clustering with the bestD
- We can find a better
- How to choose
K?- When we increase
K,Ddecreases, but we do not get better results - Best to increase
Kuntil you stop seeing large improvements inD
- When we increase
- If we have non-linear groups, it is hard to cluster it with linear K-means comparison metrics
- So we can project it into some new space, by perhaps squaring the elements
φ(x) = x²k(x₁, x₂) = φ(x₁).t ⋅ φ(x₂)
- Some kernel functions:
- Linear kernel:
k(x₁, x₂) = x₁.t ⋅ x₂ - Gaussian kernel:
k(x₁, x₂) = exp{γ||x₁ - x₂||²} - Polynomial kernel:
k(x₁, x₂) = (x₁.t ⋅ x₂ + c)^β
- Linear kernel:
- For K-means, whenever we compare to elements we now use this kernel function
- When calculating distance to cluster, we average the cluster then use the kernel function
- We can use the kernel trick to estimate the distance between an element of
Xand a cluster inC
- Kernel trick:
D(x, μ) = (x - μ).t ⋅ (x - μ)
μ = avg(for x in μ)
D(x, μ) = (x - sum(for x in μ)).t ⋅ (x - sum(for x in μ))
D(x, μ) = (x.t ⋅ x)
... - 2*sum((x't ⋅ x) for x' in μ)
... + avg((x'.t ⋅ x'') for x' in μ, for x'' in μ)
And then replace dot product with kernel function:
D(x, μ) = k(x, x)
... - 2*sum(k(x, x) for x' in μ)
... + avg(k(x', x'') for x' in μ, for x'' in μ)
- Sensitive to initialisation
- Computationally expensive, having to calculate
n²distances each iteration - Need to run repeatedly
- Makes hard assumptions, each element is 100% belonging to 1 cluster
- Agglomerative (bottom-up) approach
- Each object starts as a singleton cluster
- The two most similar clusters are merged iteratively
- Stops when all objects are in the same cluster
- Divisive (top-down) approach
- Each object starts in same cluster
- Outsider objects from least cohesive cluster are removed iteratively
- Stops when all objects are in singleton clusters
- Measuring similarity between clusters:
- Smallest distance between elements
- Clusters can get very large
- Largest distance between links
- Small, round clusters
- Average distance between links
- Compromise between small and large
- Smallest distance between elements
Algorithm:
- Each object is a singleton cluster
- Compute distance between all pairs of clusters
- Merge the closest pair
- If there is more than one cluster, go to step 2
- Can tune the level of clustering easier
- More efficient than K-means
- Generative classifiers generate a model for each class
- e.g. Bayesian classifiers have some data for existing classes, and for new data try to see what class model fits best
- Discriminative classifiers attempt to define where these classes cross over
- TODO: These definitions seem a bit weak, are they correct?
- We want to find a linear hyperplane that separates the two classes positions
- Formalised as:
w.t ⋅ x + b = 0
- Formalised as:
- We can then find the best
wandbthat optimise this decision boundary- End up having to perform
argmin_w 1/2 w.t ⋅ w, with the constraint thattₙ(w.t ⋅ xₙ + b) ≥ 1 - See slides, I ain't copying that out
- End up having to perform
- Called SVMs due to the vectors closest to the decision boundary that "support" it
- Allow the some data points to lie closer to (or even over) the decision boundary
- This is done by relaxing the constraint to:
tₙ(w.t ⋅ xₙ + b) ≥ 1 - ξₙandξₙ ≥ 0
- The optimisation problem becomes:
argmin_w 1/2 w.t ⋅ w + C*sum(ξₙ)
Ccontrols the "softness" of the boundary
- Project into another space
x → φ(x)
- But we can just use the kernel tricks
x₁.t ⋅ x₂ → k(x₁, x₂)
- Gaussian kernel:
exp{-β||x₁ - x₂||²}βcontrols model complexity
- If we have
N'classes, haveN'different classifiers that do 1 vs rest classification - Each SVM returns a confidence score, we pick the one that gives the most confidence
- Can also train
N'(N' - 1) / 2different binary classifiers- Assign according to maximum voting
- Addresses class imbalance in the data
- Dimensionality is the number of features in the data
- Gradually add features, seeing how it effects performance
- Make new features by merging old features (linearly/non-linearly)
X = Y⋅WwhereYis(N x M), the original data with high dimensionalityMWis(M x D), the matrix that reduces dimensionality toDXis(N x D), the data with reduced dimensionalityD
- We want to preserve the interesting aspects of the data
- Define the columns of
Wone by one- This means we define one of the output dimensions one by one
xₙ = Y⋅wₙto get thenth output dimension- Each
wₙis defined so that- The variance of
xₙis minimised - The new column
wₙis orthogonal to previous columns
- The variance of
Wis the collection of eigenvectors of the covarience matrixΣofY- Calculating
ΣΣₘₙ = avg((yₘ - μₘ)(yₙ - μₙ) for y in Y)- If we set
Y' = Y - avg(Y) Σ = (1/N)(Y'.t ⋅ Y')
- Calculating
Wwₙ.t ⋅ wₘ = 0whenn ≠ m- Chose
wₙto maximise variance:argmax_w var(Y'w)var(Y'w) = (Y'w).t ⋅ (Y'w)var(Y'w) = w.t ⋅ Y'.t ⋅ Y' ⋅ wvar(Y'w) = w.t ⋅ NΣ ⋅ wargmax_w w.t ⋅ Σ ⋅ w- Must have constraint
w.t ⋅ w = 1, otherwise can just keep increasingwto increase variance - Using Lagrange multiplier trick to add constraint in:
argmax_w (w.t⋅Σ⋅w - λ(w.t⋅w - 1))
- Differentiate w.r.t
wand set to 02Σw - 2λw = 0Σw = λw
- Solve using eigenvalue/eigenvector method (not in this module)
- Aside about eigenvectors
Σw = λwmeans that matmul byΣforwincreases by a scalar factorwis called an eigenvector, andλis the eigenvalue
- Solving
Σw = λwgives usMeigenvectors andMeigenvalues- The one with the highest
λhas the highest variance (we can see this by mutliplying both sides byw.t
- The one with the highest
- In summary: the principle components of
Yare the eigenvectors{w₁, w₂...}, ordered by their eigenvalues{λ₁, λ₂...}
- Combine weak classifiers to make a strong classifier
- Boosting
- Train a new classifier focusing on training examples misclassified b earlier model
- Bagging (bootstrap aggregation)
- Generate new training data as a random subset of original data
- Train new classifier on this subset
- For each split node
n, go right iffₙ(x) ≤ thₙ, and go left otherwise, wherexisNdimensional attributesfₙis of the formf(x) = w₀x₀ + w₁x₁ + ... + wₙxₙthₙis the threshold value
- For each leaf node, return a probability distribution across all classes
- Select a subset of the data
X - Create a set of random features
ff(x) = ax₁ + bx₂for 2D data, wherea, bare chosen randomly
- Try several combinations of
fandth- Separate
Xaccording tofandthintoXₗ, Xᵣ - Pick the
fandththat optimise the information gain of this split IG(Xₗ, Xᵣ) = E(Xₗ + Xᵣ) - (|Xₗ|/|X|)E(Xₗ) - (|Xᵣ|/|X|)E(Xᵣ)- where
E = -sum(P(c)log₂(P(c)) for c in classes)
- where
- Separate
- Recurse for left and right children
- Generate
nbinary decision trees- Each one uses a random subset of the training data
- Subsets can overlap
- Increases runtime
- Each one uses a random subset of the training data
- When evaluating, run all and average the response
- We can draw approximate point clouds based off the mean and covariance matrices of a point cloud
- We will assume two dimensions
- Uniform variance and no covariance:
- e.g.
[[a, 0], [0, a]] - Circle point cloud, where
ais proportional to its width
- e.g.
- Non-uniform variance and no covariance
- e.g.
[[a, 0], [0, b]] - Ellipse point cloud, where
ais proportional to its width onx-axis, andbis proportional to its heigh ony-axis - Ellipse is always axis-aligned
- e.g.
- Uniform variance and covariance
- e.g.
[[a, c], [c, a]] - Ellipse point cloud, where
ais proportional to its width on both axes, andcis how "thin" the ellipse is - Aligned with
x = yline - Unless
cis negative, then aligned withx = -y
- e.g.
- Non-uniform variance and covariance
- e.g.
[[a, c], [c, b]] - Ellipse point cloud, where
ais proportional to its width onx-axis, andbis proportional to its heigh ony-axis, andcis how "thin" the ellipse is - Assuming
bis positive, is more aligned tox-axis ifa > band vice versa
- e.g.
- Without naive bayes assumption
- Identical variance, no covariance
- Linear decision boundary
- Identical variance, identical covariance
- Linear decision boundary
- Non-identical variance, non-identical covariance
- Non-linear decision boundary
- Identical variance, no covariance
- With naive bayes assumption
- Identical variance
- Linear decision boundary
- Non-identical variance
- Non-linear decision boundary
- Identical variance
- Linear regression
- Bayesian classification (maximum a posterior)
- Bayesian classification (maximum likelihood)
- Better when classes are equally distributed
- Bayesian classification (naive bayes)
- Better for low amount of data
- K nearest neighbours
- Worse for high amount of attributes
- Because more likely to provide bad attributes
- Worse for high amount of attributes
- Support vector machines
- Better for low amount of data
- Better for dealing with outliers
- Randomised decision forests
- K-means clustering
- Hierarchical agglomerative clustering