-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathch10.Rmd
More file actions
68 lines (58 loc) · 2 KB
/
ch10.Rmd
File metadata and controls
68 lines (58 loc) · 2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Principal Components
====================
We will use the `USArrests` data (which is in R)
```{r}
dimnames(USArrests)
apply(USArrests,2,mean)
apply(USArrests,2, var)
```
We see that `Assault` has a much larger variance than the other variables. It would dominate the principal components, so we choose to standardize the variables when we perform PCA
```{r}
pca.out=prcomp(USArrests, scale=TRUE)
pca.out
names(pca.out)
biplot(pca.out, scale=0)
```
K-Means Clustering
==================
K-means works in any dimension, but is most fun to demonstrate in two, because we can plot pictures.
Lets make some data with clusters. We do this by shifting the means of the points around.
```{r}
set.seed(101)
x=matrix(rnorm(100*2),100,2)
xmean=matrix(rnorm(8,sd=4),4,2)
which=sample(1:4,100,replace=TRUE)
x=x+xmean[which,]
plot(x,col=which,pch=19)
```
We know the "true" cluster IDs, but we wont tell that to the `kmeans` algorithm.
```{r}
km.out=kmeans(x,4,nstart=15)
km.out
plot(x,col=km.out$cluster,cex=2,pch=1,lwd=2)
points(x,col=which,pch=19)
points(x,col=c(4,3,2,1)[which],pch=19)
```
Hierarchical Clustering
=======================
We will use these same data and use hierarchical clustering
```{r}
hc.complete=hclust(dist(x),method="complete")
plot(hc.complete)
hc.single=hclust(dist(x),method="single")
plot(hc.single)
hc.average=hclust(dist(x),method="average")
plot(hc.average)
```
Lets compare this with the actualy clusters in the data. We will use the function `cutree` to cut the tree at level 4.
This will produce a vector of numbers from 1 to 4, saying which branch each observation is on. You will sometimes see pretty plots where the leaves of the dendrogram are colored. I searched a bit on the web for how to do this, and its a little too complicated for this demonstration.
We can use `table` to see how well they match:
```{r}
hc.cut=cutree(hc.complete,4)
table(hc.cut,which)
table(hc.cut,km.out$cluster)
```
or we can use our group membership as labels for the leaves of the dendrogram:
```{r}
plot(hc.complete,labels=which)
```