SDS_HomeWork_3/Brain_2.Rmd at main · Statistics-For-Surfer/SDS_HomeWork_3 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
---
title: "Statistics for Data Science - Homework 3"
author: "Barba Paolo, Candi Matteo"

output:
  html_document:
     code_folding: hide
     theme:
      color-contrast-warnings: false
      bg: "#2B3E50"
      fg: "#B8BCC2"
      primary: "#EA80FC"
      secondary: "#B8BCC2"
      base_font:
      google: Prompt
      heading_font:
        google: Proza Libre

  editor_options:
     markdown:
        wrap: 75
---

```{=html}
<style type="text/css">
body, td {
   font-size: 14px;
}
code.r{
  font-size: 12px;
}
pre {
  font-size: 12px
}
title {
font-style: bold
}
</style>
```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r Packages , warning = FALSE , message = FALSE , results='hide' , echo = FALSE}
rm(list = ls())
#load packages
packs <- c('MASS', 'pracma', 'psych','caTools','e1071','dgof','latex2exp', 'caret', 'glmnet','ggplot')
lapply(packs, require, character.only = TRUE)
```

#### Purpose and Statistical tools used

The goal of the project is to perform Friedman's procedure as a proxy of
multiple studies and apply it to fMRI (functional magnetic resonance
imaging) data in order to test whether the data from td (typically
development) subjects and asd (autism spectrum disorder) subjects came
from the same distribution or not. In order to reach this goal we used
statistical tools such as machine learning algorithms and two-sample
Hypothesis testing.

## **Exercise 3**

#### Two sample hypothesis test:

Friedman's procedure is used for solving two-sample hypothesis testing
when each observation consists of many measured attributes
$x_{i} = (x_{i_1}, x_{i_2} . . . x_{i_{M}})$ and
$z_{i} = (z_{i_1}, z_{i_2} . . . z_{i_{M}})$ where M is the dimensionality
of the dataset and we have $n_{0}$ samples for $\underline{x}$ and
$n_{1}$ samples for $\underline{z}$. We create then a hypothesis system
as the following:

```{=tex}
\begin{equation}

\begin{cases}

H_{0}: F_{x} = F_{z}\\ H_{1}: F_{x} \neq F_{z}

\end{cases}

\end{equation}
```

Since the distribution we want to compare is multivariate, there
are not a variety of procedures that can compare directly. The idea
then, is to perform classification machine learning algorithm as
logistic regression, compute the scores of the observed data
$\underline{s_{x}}$ and $\underline{s_{z}}$ and compare the univariate
distribution using a two-sample hypothesis test as the Kolmogorov-Smirnov test.
So the hypothesis system would be as the following:

```{=tex}
\begin{equation}

\begin{cases}

H_{0}: F_{s_{x}} = F_{s_{z}}\\ H_{1}: F_{s_{x}} \neq F_{s_{z}}

\end{cases}

\end{equation}
```
The take out the distribution under the null hypothesis we lead the
following procedure

<div>

<ol>

<li>Randomly permute the group label</li>

<li>Train classifier (Logistic regression in our case)</li>

<li>Compute the scores of the observations</li>

<li>Compute the test statistic (Kolmogorov-Smirnov in our case)</li>

<li>Repeat P times</li>

</ol>

</dic>

After this procedure, we obtain a set of statistics test
$\underline{t} = {t_{1} \dots t_{n}}$.

Under $H_{0}$, the entire data vector ${\underline{x} ,\underline{z}}$
is an $i.i.d$ sample from a single distribution F, the group identity is exchangeable. In the following, we work under the permutation null distribution, which places $\frac{1}{(n_{0} + n_{1})!}$ probability of each of the permutation of the group labels.

Then the observed statistics $\hat{t}$ is equally likely to be anywhere
in $\underline{t}$.

We can compute the p-value
$p = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(t_j \geqslant \hat{t}) = \{\text{proportion of permuted statistics larger than the observed one}\}$
setting the decision rule in order to reject the null hypothesis if the
value of $p$ is lower than $\alpha$.

#### Goodness of fit test:

The two-sample hypothesis test can be turned into a goodness of fit test assuming that the sample $\underline{z}$ came from
a reference distribution (Say $F$). We can set up the following hypothesis system:

```{=tex}
\begin{equation}

\begin{cases}

H_{0}: F_{x} = F\\ H_{1}: F_{x} \neq F

\end{cases}

\end{equation}
```


In the goodness of fit hypothesis test, only one sample
$\underline{x} = \{ x_{i1} \dots x_{iM} \}_{i =1}^{N}$ is available and the purpose is to
test if this sample came from the reference distribution $F$. The
distributions are still multivariate and we can use a classifier to
reduce multivariate goodness of fit test into a univariate one. The
hypothesis system change as the following:

```{=tex}
\begin{equation}

\begin{cases}

H_{0}: F_{s_{x}} = F_{s}\\ H_{1}: F_{s_{x}} \neq F_{s}

\end{cases}

\end{equation}
```

To perform the above test it is still possible to use the Kolmogorov-Smirnov test. In order to get
out the distribution of the Kolmogorov statistics under the null
the hypothesis we can set up a Monte Carlo simulation as the following:

<div>

<ol>

<li>Drawn a sample $z_{i}$ of size $n_{1}$ from $p_{0}$.</li>

<li> Use them with the actual data to train the classification model and
compute the scores.</li>

<li>Compute the statistics test between the two scores sample.</li>

<li>Repeat P times.</li>

</ol>

</dic>

After this procedure, we obtain a set of statistics test
$\underline{t} = {t_{1} \dots t_{n}}$ that can be used as the
distribution of statistic test under the null hypothesis. So we can
build the reject region $R_{\alpha}$ as the values of the test statistic
greater than a threshold equal $1 - \alpha$ quantile of the
distribution. We reject with significance level $\alpha$ if the observed
statistics are greater than the threshold.

We can compute the p-value
$p = \frac{1}{N} \sum_{j=1}^N \mathbb{I}(t_j \geqslant \hat{t}) = \{\text{proportion of permuted statistics larger than the original}\}$
setting the decision rule to reject the null hypothesis if the
value of $p$ is lower than $\alpha$.


```{r Used functions , warning=FALSE}
# Take random sample from a multi-normal distribution.
Take_sample_normal <- function(n, mu, sigma, label){
  x <- mvrnorm(n, mu, sigma)    # Random generate from a multivariate normal
  x <- as.data.frame(cbind(x, label))   # add label
  colnames(x[length(colnames(x))]) <- "label"   # col label

  return(x)
}

# Apply Friedman procedure.
Friedman_procedure <- function(P,x_data , z_data , permut = FALSE){
  x_fri <- x_data       # Copy the data
  z_p <- z_data         # copy the z data
  kolm_t <- rep(NA, P)  # Pre-set the Kolmogorov-Smirnov statistic
  labels <- c(rep(0,n0),rep(1,n1))   # Set of the label
  for(i in 1:P){                     # Loop
    print(i)
    if(permut == F){                 # Check goodnees of fit
      z_p <- Take_sample_normal(n1, mu, sigma, label =1)  # Sample under the null F
      }
    if(permut == T){                 # Two sample test
    idx <- sample(x = 1:(n0+n1), n0+n1)    # Shuffle the label

    x_fri$label <- labels[idx[1:n0]]            # Permuted label
    z_p$label <- labels[idx[(n0+1):(n0+n1)]]
    }   # Permuted label
    u_p <- as.data.frame(rbind(x_fri,z_p))      # Row bind the two data frame
    glm_f <- glm(label ~ . , data = u_p )
    # Train the model
    scores <- predict(glm_f ,u_p[,1:k], s = 0.05)         # Compute the scores
    u_p <- as.data.frame(u_p)
    kolm_t[i] <- ks.test(scores[u_p$label == 0] , scores[u_p$label == 1])$statistic
  }
  return(kolm_t)

}

# Get information about alpha.
alpha_info <- function(P , permut = FALSE){
  prop_rej <- rep(NA, P)    # Pre-set accept/reject values
  p_values <- rep(NA , P)   # Preset pvalues

  for(i in 1:P){            # Loop over P
    x_p <- Take_sample_normal(n = n0, mu = mu, sigma = sigma, label = 0)  # Take sample of
    z_p <- Take_sample_normal(n = n1, mu = mu, sigma = sigma, label = 1)  # same distributions
    u_p<- rbind(x_p, z_p) # Combine the data
    if(permut == FALSE){   # Goodness of fit test
      kk <- Friedman_procedure(P, x_data = x_p , z_data = z_p)}
    if(permut == TRUE){    # Two sample test
      kk <- Friedman_procedure(P, x_data = x_p , z_data = z_p , permut = T)
    }

    glm_model <-glm(label ~ ., data = u_p)   # Perform the glm observed model

    x_scores <- predict(glm_model , x_p[,1:k])   # Compute the obs score
    z_scores <- predict(glm_model,  z_p[,1:k])   # ""

    true_kolm <- ks.test(x_scores, z_scores, alternative = "two.sided")$statistic  # observed statistic
    prop_rej[i] <- true_kolm < quantile(kk , 1 -alpha)   # check if accept or no
    p_values[i] <- sum(kk > true_kolm ) / length(kk)     # Store the p-values
  }

  data = as.data.frame(cbind(prop_rej,p_values))        # combine the dataframe
  colnames(data) <- c("KS","p_values")                  # Change the names
  return(data)
}

```

```{r parameters}
k <- 5    # Dimensions
n0 <- 70  # Sample size 0
n1 <- 70  # Sample size 1
P <- 100      # Simulation size
alpha <- .05  # significance level
set.seed(324) # reproducibility

mu <- rep(0,k)    # Mean <- c(0,0,0,0,0 ...)
sigma <- diag(1 ,k)  # Identity matrix for sigma
```


```{r alphas , fig.height = 4, fig.width = 9, fig.align = "center", echo = FALSE}

palette <- c("#FF6F59", "#FFFD98", "#BDE4A7", "#9FBBCC", "#00487C" )


par(mfrow=c(1,2), col='white', col.axis='white', col.lab='#B8BCC2', col.main='#B8BCC2', fg='white', bg="#2B3E50")
load("data/alpha_info_1.RData")
t <- proportions(table(data[,1]))
barplot(t, col =c("#eb4934", '#9ee8e8') , main = TeX(r"(Proportion of times we accept / reject $H_{0}$ when is True)" , bold = T) ,cex.main = .9, names.arg = c("Reject" , "Accept"), ylim = c(0,1) , las = 1, border = "#2B3E50")


hist(data[,2],freq = F , breaks = 6 , border = "#2B3E50",
      col = 'khaki1', main = TeX(r"(p-value distribution under $H_{0}$ )", bold = T), xlab = TeX(r"(\hat{p}-value)"), las=1)
points(seq(0,1,length.out = 300) , rep(1,300) , col = 'white' , type='l', lwd=2)

legend("topright" , legend = "Uniform\ndistribution" , border = "white" , col = "white" , bty = "n", lwd = 3,cex=.8)

```


#### Comments

In the simulation above we are in a scenario where $n_{0} = n{1} = 70$,
the dimensionality $k = 5$, and the simulation size $P = 100$ and we are
dealing with the goodness of fit hypothesis test. The first plot shows the empirical size
$\alpha \approx 0.05$ and the second one show the distribution of the
observed $\hat{p}$ values with the theoretical $Unif(0,1)$ distribution.
Increasing $P$ the histograms would fit better with the theoretical distribution.

The real goal of this type of simulation is to see how an algorithm
would perform w.r.t to metrics as size and power while changing input
parameters as $n_{0}, n_{1}, k$.

#### Performance with rispect to the size $\alpha$

```{r changing alpha, changing n0, warning=FALSE , eval = FALSE}
# Changing n0.
n0s <- c(20,50,70,100)   # Values of alpha tested
k <- 5                   # Set the dimensionality as 5
P <- 150
alpha_Friedmann <- rep(NA ,length(n0s))    # Preset the empirical alpha
alpha_per <- rep(NA , length(n0s))         # Preset the empirical alpha
for(i in 1:length(n0s)){
  n0 <- n0s[i]
  alpha_Friedmann[i] <- 1-mean(alpha_info(P)$KS)
  alpha_per[i] <- 1-mean(alpha_info(P,permut = T)$KS)
}

# Changing k.
ks <- c(5,20,70,100)
P <- 150
n0 <- 70
n1 <- 70
alpha_permutation_2 <- rep(NA ,length(ks))
alpha_Friedmann_2 <- rep(NA ,length(ks))
for(i in 1:length(ks)){
  print(i)
  k <- ks[i]
  mu <- rep(0,k)
  sigma <- diag(1,k)
  alpha_permutation_2[i] <- 1-mean(alpha_info(P)$KS)
  alpha_Friedmann_2[i] <- 1-mean(alpha_info(P,permut = T)$KS)
  }

```

```{r plots change size ,fig.height = 4, fig.width = 9, fig.align = "center",echo = FALSE}
par(mfrow=c(1,2), col='white', col.axis='white', col.lab='#B8BCC2', col.main='#B8BCC2', fg='white', bg="#2B3E50")

load("data/alpha_fr_change_n.RData")
load("data/alpha_per_change_n.RData")
n0s <- c(20,50,70,100)   # Values of alpha tested

{plot(n0s ,alpha_Friedmann , type = "l" , ylim = c(0,0.2) , lwd = 2 , ylab = TeX(r"($\alpha$)")  ,xlab = TeX(r"(sample size $n_{0}$)") , col = "#f2f2f2" , main = TeX(r"($alpha$ vs $n_{0}$ with $k = 5$ and $n_{1} = 70$)", bold = T), las=1)
points(n0s , alpha_per , type = "l" , col = "gold2" , lwd = 2)

legend("topright" , legend = c("Goodness of fit test", "Two sample test") ,
       col = c("#f2f2f2","gold2") , border = "white" ,
       lty = 1 , bty = "n",
       cex =.9, lwd = 3)
grid()}


load("data/alpha_per_change_k.RData")
load("data/alpha_fr_change_k.RData")
ks <- c(5,20,70,100)

{plot(alpha_fr_change_k, type = "l" , ylim = c(0,0.2) , lwd = 2 , ylab = TeX(r"($\alpha$)")  ,xlab = TeX(r"(dimensionality $k$)") , col = "#f2f2f2" , main = TeX(r"($alpha$ vs $k$ with $n_{0} = 70$ and $n_{1} = 70$)", bold = T), las=1)
points(alpha_per_change_k, type = "l" , col = "gold2" , lwd = 2)

legend("topright" , legend = c("Goodness of fit test", "Two sample test") ,
       col = c("#f2f2f2","gold2") , border = "white" ,
       lty = 1 , bty = "n",
       cex =.9, lwd = 3)
grid()}

```

####  Comments

Since the test is built to control the type I error $\alpha$  it should be noted that the empirical result is approximately close to the desired control $0.05$, especially when the amount of data grows. As the dimensionality grows plot shows that the test has more difficult controlling the type I error. When the dimensionality is high, the performance of the test depends a lot on the machine learning algorithm used. More flexible algorithms would perform better than complex ones.


### Performance with rispect to the power $1 - \beta$

To get out information about the power of the test (probability to get a
correct discovery) we set up a simulation generating sample from
difference distribution to check how likely it is that our test rejects
$H_{0}$ when it is false. We set up different scenarios where
$F_{x} \neq F_{z}$ and count how many
times we reject the null hypothesis. Of course, if the two distributions
are close to each other is more difficult to discover the difference and
reject $H_{0}$ then. If the two distributions are far to each other, it would be easier for the test detect the differences and reject $H_{0}$ then.

In the simulation study we choose Gaussian parametric family for both
$F_{x}$ and $F_{z}$, let change the  $F_{z}$  position parameters
$\underline{\mu}$ since it possible to compute a closed-form of Wasserstein distance of two distributions for two multivariate normal distributions $D(F_{x},F_{z}) = D(\underline{\mu_{x}} ,\underline{\mu_{z}})$
. $$
\boldsymbol{X} \sim \mathrm{N}_k(\boldsymbol{\mu}_1, \Sigma_1)
\hspace{.3cm}
\boldsymbol{Y} \sim \mathrm{N}_k(\boldsymbol{\mu}_2, \Sigma_2)
 $$

 $$
\textsf{W}^2_2(\boldsymbol{X}, \boldsymbol{Y}) = \| \boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\|_2^2 + \mathrm{B}^2(\Sigma_1, \Sigma_2)
 $$

```{r distance and power info , warning = FALSE}

# Distance functions.
beta_q <- function(sigma1, sigma2){
  trace_1 <- tr(sigma)    # Trace of sigma X
  trace_2 <- tr(sigma2)   # Trace of sigma Y
  trace_3 <- tr(sqrtm(sqrtm(sigma1)$B %*% sigma2 %*% sqrtm(sigma1)$B)$B)

  return(trace_1 + trace_2 - 2*trace_3)
}

Wasserstein_distance <- function(mu1, mu2, sigma1, sigma2){
  norma <- norm(mu1-mu2, type = "2")**2
  beta_quadro <- beta_q(sigma1, sigma2)

  return(norma + beta_quadro)
}

power_info <- function(P, k, increment , permut = F){

  prop_ks <- rep(NA, P)    # Pre-set proportions of accept/reject
  for(i in 1:P){
    x <- Take_sample_normal(n = n0, mu= mu, sigma = sigma, label = 0)  # Take sample of x
    z <- Take_sample_normal(n = n1, mu=  mu2, sigma = sigma2, label = 1) # Take sample of z (From a different distribution)
    u <- rbind(x,z)    # Combine the dataset
    p_model <- glm(label ~ ., data = u)   # Perform the obs model
    x_scores <- predict(p_model, x)       # Predict the obs scores
    z_scores <- predict(p_model, z)       # ""
    kolm_obs <- ks.test(x_scores, z_scores, alternative = "two.sided")$statistic                    # Obs statistic
    if(permut == FALSE){
      kk <- Friedman_procedure(P,x_data = x, z_data = z)
    }
    if(permut == TRUE){
      kk <- Friedman_procedure(P,x_data = x, z_data = z , permut = T)
    }

    prop_ks[i] <- kolm_obs > quantile(kk,1-alpha)
  }
  data <-  mean(prop_ks)


  return(data)
}


```


```{r power info changing distance , warning = FALSE , eval = FALSE, echo=FALSE}
max_in <- .9
P <- 100
k <- 5
n0 <- 30
n1 <- 20
mu <- rep(0,k)
sigma <- diag(1,k)
ll <- seq(from = 0 , to = max_in , by = 0.1)
powers <- c()
powers_2 <- c()
distance <- c()
for (i in (ll)){
  mu2 <- mu + i
  print(mu2)
  sigma2 <- sigma
  distance <- c(distance,Wasserstein_distance(mu, mu2, sigma, sigma2))
  powers <- c(powers,power_info(P, k = length(mu),increment = i))
  powers_2 <- c(powers_2 ,power_info(P, k = length(mu),increment = i , permut = T))
}


```


```{r plot power info twosamplegof , fig.height = 4, fig.width = 5, fig.align = "center" ,echo = FALSE}

load("data/power_GoF.RData")
load("data/power_two_sample_test.RData")
load("data/distance_Gof_two.RData")


par(col='#dedbd9', col.axis='white', col.lab='#B8BCC2', col.main='#B8BCC2', fg='white', bg="#2B3E50")

plot(distance,powers, main = TeX("Power vs Distance $W_{2}^{2}(X,Z)$" , bold = T), xlab = "distance", ylab = "power",
     col = '#f2f2f2' , type = "b" ,pch = 16, lwd = 2, las=1)
points(distance, powers_2 , col = "gold2" , type = "b" ,pch=16, lwd = 2)
legend("bottomright" , legend = c("GoF", "Two-sample") , col = c("#f2f2f2","gold2"),border = "white" ,
       lty = 1 , bty = "n",
       cex =.8, lwd = 3, pch=1)
grid()

```


##### Comments

In the Goodness of fit hypothesis testing we use the additional
information of the distribution of $F_{z} = F$, such of that, this
procedure has the potential to increase the power of the test. However,
in the real application when our knowledge is as limited knowing only the sample and, so to perform goodness of fit test we need to assume a reference distribution for the sample. This can be done with using past knowledge and studies.
In our case, we will remain in the context of two sample hypothesis test.


#### Test power vs sample size and test power vs dimensionality

The test's usefulness is based on its ability to distinguish among a diverse range of alternative hypotheses. In this section, we present the outcomes of different simulation studies that investigate the test's power for various alternative hypotheses across multiple dimensions.

We focus our studies on the context of two sample hypothesis test.
We set in different scenarios, where based on our knowledge simulation, the sample $\underline{x}$ came from a $F_{1} \sim N(\underline{\mu_{1}},\Sigma)$ and the sample $\underline{z}$ came from a $F_{2} \sim N(\underline{\mu_{2}},\Sigma)$ with different $\underline{\mu_{2}}$. Greater the distance from  $\underline{\mu_{1}}$ and $\underline{\mu_{2}}$, the more separated will be the two distributions we are comparing, and so in all likelihood, the test can detect the differences better.
Specifically, we tweak the parameter $n_{0}$ and $k$ to get information about how the power varies with the distance and with the parameter values.

The plot on the left show how the power change according to the distance, let varying the sample size $n_{0}$.

The plot on the right show how the power change according to the distance, let varying the dimensionality $k$.

```{r plot distance power changing n0 and k, fig.height = 4, fig.width = 9, fig.align = "center" ,echo = FALSE}

par(mfrow=c(1,2), col='white', col.axis='white', col.lab='#B8BCC2', col.main='#B8BCC2', fg='white', bg="#2B3E50")
load('data/distance_power_n0.RData')

colors <- c('white', 'gold2', 'green', 'coral1')


plot(1,1, type='n', xlim=c(0,3), ylim=c(0,1), xlab='distance',ylab='power', las=1, main = TeX('Power vs distance with $n_1=70$ & $k=20$', bold=T))
grid(nx=NA, col="#f2f2f2", ny=NULL)
for(i in 1:length(distance_power_n0)){
  points(distance_power_n0[[i]], type='l', col=colors[5-i], lwd=2)
  }
legend('bottomright', legend=c(20,50,70,100), lwd=3, col = colors[length(colors):1], bty='n', title = TeX('$n_0$'))


load('data/distance_power_k.RData')

plot(1,1, type='n', xlim=c(0,4), ylim=c(0,1), xlab='distance',ylab='power', las=1, main=TeX('Power vs distance with $n_0=70$ & $n_1=60$', bold=T))
grid(nx=NA, col="#f2f2f2", ny=NULL)
for(i in 1:4){
  points(distance_power_k[[i]], type='l', col=colors[i], lwd=2)
  }
legend('bottomright', legend=c(5, 20, 50, 100), lwd=3, col = colors, bty='n',title = 'k')

```

##### Comment

In both plots, the greater the distance, the greater the power will be.
Increasing the values of the sample size $n_{0}$, would increase the power of the test. Adding information would give more possibilities to the test in order to detect the difference between the two distributions and the performance of the test both on the side and the power would be better.
Increasing the dimensionality the test has low power due the curse of dimensionality, with the small amount of data, the higher the dimension, the thinner the information effectively available to resolve the difference would be and without a huge amount of data, it would return a lower power of the test. This is highlighted when $k = 100$, the power results too low or even zero and it is needed to have a huge amount of data to get better performance.

## **Exercise 4**

```{r load dataframe asd td, echo = FALSE}
load('hw3_data.RData')
```


### Dataset

The dataset we used for the sake of this project is ABIDE (Autism Brain
Imaging Data Exchange); a collection of brain imaging data and clinical
information from individuals with autism spectrum disorder (ASD) and
typically developing (TD) controls.

The dataset consists of structural and functional fMRI data from over
85 individuals with ASD and 93 TD controls.

Since data were taken in different labs, it's worth removing the
lab-specific effect normalizing over all the different labs. So we scale the patient's values with the mean and the standard deviation of all data of the laboratory that takes the datas of the patient's brain. In this way we can compare subjects coming from different laboratories.

As the previous study suggests, to try to separate the two groups (ASD vs TD) we will by digging (linear) dependencies of cortical regions. Such that we will study the time series of the autocorrelation instead of the original ones.
Autocorrelation is defined as : $AC(t_1,t_2) = Corr(X_{t_1} ,X_{t_2})$.


```{r scale dataset by lab, eval=FALSE}
# Put all data together.
all_data <- c()
for(el in 1:length(asd_data))  all_data <- c(all_data, asd_data[el])
for(el in 1:length(td_data))  all_data <- c(all_data, td_data[el])

# Take unique names of labs.
lab <- names(all_data)
init_lab <- c()
for(el in lab) init_lab <- c(init_lab, substr(el,1,2))
u_init_lab <- unique(init_lab)

# Scale every data frame by lab.
for(l in u_init_lab){
  pos <- which(grepl(l, init_lab))
  d <- matrix(NA, 0, 116)
  for(i in pos) d <- rbind(d, all_data[[i]])
  d <- as.matrix(d)
  m <- mean(d)
  s <- sd(d)
  for(i in pos) all_data[[i]] <- (all_data[[i]] - m) / s
}

# Get new scaled data frames.
asd_data_lab_scale <- all_data[1:85]
td_data_lab_scale <- all_data[86:178]


# Function to get summaries from all patient's ROIs.
list_summary <- function(list, fun){
  len <- length(list)
  tab <- matrix(NA, nrow = len, ncol = 116)

  for(i in 1:len){
    tab[i,] <- unname(sapply(list[[i]], fun, na.rm = T))  }

  return(tab)
}


# Get the time series of the autocorrelations for each ROI of each patient.
td_corr <- td_data_lab_scale
for(p in 1:length(td_corr)){
  patient <- td_corr[[p]]
  for(i in 1:ncol(patient)){
    auto_corr <- acf(td_corr[[p]][,i], plot=F, type = "correlation", lag = length(td_corr[[p]][,i]))$acf[,,1]
    # Handle cases of ROIS with all the same values.
    if(NaN %in% auto_corr) auto_corr <- rep(1, length(td_corr[[p]][,i]))
    td_corr[[p]][,i] <- auto_corr
  }
}

asd_corr <- asd_data_lab_scale
for(p in 1:length(asd_corr)){
  patient <- asd_corr[[p]]
  for(i in 1:ncol(patient)){
    auto_corr <- acf(asd_corr[[p]][,i], plot=F, type = "correlation", lag = length(asd_corr[[p]][,i]))$acf[,,1]
    if(NaN %in% auto_corr) auto_corr <- rep(1, length(asd_corr[[p]][,i]))
    asd_corr[[p]][,i] <- auto_corr
  }
}

```

For each time series we extract three features: *Mean*, *Median* and *Standard Deviation*.


```{r summaries of data, eval=FALSE}


# Get new data frames with summaries of all the patients for ASD and TD.
td_ROI_mean <- list_summary(td_corr, mean)
td_ROI_median <- list_summary(td_corr, median)
td_ROI_sd <- list_summary(td_corr, sd)
x_data <- c()
x_data <- as.data.frame(cbind(td_ROI_mean, td_ROI_median, td_ROI_sd, rep(0, length(td_data))))
names(x_data)[ncol(x_data)] <- 'label'


asd_ROI_mean <- list_summary(asd_corr, mean)
asd_ROI_median <- list_summary(asd_corr, median)
asd_ROI_sd <- list_summary(asd_corr, sd)
z_data <- c()
z_data <- as.data.frame(cbind(asd_ROI_mean, asd_ROI_median, asd_ROI_sd, rep(1, length(asd_data))))
names(z_data)[ncol(z_data)] <- 'label'

```

Since the data we collect are a finite sample ($n_{0} + n_{1} = 93 + 85 = 178$) with high dimensionality $k = 116 \times 3 = 348$ a vanilla logistic regression would not perform goog- Such that we need to regularize setting an early stopping and a $l_{2}$ regularization.

To perform the test in both cases, when the null hypothesis $H_{0}$ (the distributions of the scores for the groups are the same) is true and when is false, we split the td dataset in two in order to check the result of the test.

#### Perform the test with two groups of TD subjects respectively with the labels 0 and 1

```{r Friedman regularized, eval = FALSE}
Friedman_procedure <- function(P,x_data , z_data , permut = FALSE){
  x_fri <- x_data       # Copy the data
  z_p <- z_data         # copy the z data
  kolm_t <- rep(NA, P)  # Pre-set the Kolmogorov-Smirnov statistic
  labels <- c(rep(0,n0),rep(1,n1))   # Set of the label
  for(i in 1:P){                     # Loop
    if(permut == F){                 # Check goodnees of fit
      z_p <- Take_sample_normal(n1, mu, sigma, label =1)  # Sample under the null F
    }
    if(permut == T){                 # Two sample test
      idx <- sample(x = 1:(n0+n1), n0+n1)    # Shuffle the label

      x_fri$label <- labels[idx[1:n0]]            # Permuted label
      z_p$label <- labels[idx[(n0+1):(n0+n1)]]
    }   # Permuted label
    u_p <- as.matrix(rbind(x_fri,z_p))      # Row bind the two data frame
    glm_f <-glmnet(
      x = u_p[,1:(k-1)],
      y = u_p[,k],
      family = "gaussian",
      alpha = 0.2,
      maxit = 100
    )

    # Train the model
    scores <- predict(glm_f ,u_p[,1:(k-1)], s = 0.05)         # Compute the scores
    u_p <- as.data.frame(u_p)
    kolm_t[i] <- ks.test(scores[u_p$label == 0] , scores[u_p$label == 1])$statistic
  }
  return(kolm_t)

}
```


```{r performing the Friedmann procedure same dataset, warning = FALSE,eval=FALSE}
u <- as.matrix(rbind(x_data[1:50,],z_data))
k <- dim(u)[2]
n0 <- 50
n1 <- 85


my_model <- glmnet(
  x = u[,1:(k-1)],
  y = u[,k],
  family = "gaussian",
  alpha = 0.2,
  maxit = 100
)

sc <- predict(my_model, u[,1:(k-1)] , s = 0.05)

aa <- Friedman_procedure(100,x_data = x_data[1:50,], z_data = z_data, permut = T )
k_obs_1 <- ks.test(sc[1:50] , sc[51:135])$statistic

k_obs_1 < quantile(aa, 0.95)

```


#### Same group (TD subjects)
```{r plot score_same, echo=FALSE , warning= FALSE, fig.width=10,fig.height=5.5}
load("data/score_same.RData")
load("data/k_obs_same.RData")
load("data/test_4_same.RData")

# Build dataset with different distributions
par(mfrow = c(1,2), col='white', col.axis='white', col.lab='white', col.main='white', fg='white', bg="#2B3E50")

first <- density(score_same[1:50])
second <- density(score_same[51:93])

plot(first, xlim=c(0,1), xlab='scores', lwd=.1, las=1, main= "Distribution of scores for two group of TD subjects")
lines(second, lwd=.1)

dens_col <- c(rgb(0.2274510, 0.9411765, 0.6078431,.4),rgb(0.9882353, 0.9882353, 0.9882353, .4) )

polygon(first, col=dens_col[1])
polygon(second, col=dens_col[2])
legend('topright', legend=c('label 0', 'label 1'), pch = 15, col=dens_col, cex=1.2, bty='n')
grid(lty=2, lwd=.6)


hist(bb, main = TeX("Distribution of KS statistic under $H_0$", bold=T),
     col = "khaki1" , border = "#2B3E50" , freq = F , xlab = "ks statistic", xlim = c(0,1), las=1, ylim=c(0,8))
abline(v = quantile(bb, 0.95), col = "red", lwd = 2 , lty = 2)
abline(v = k2 , col = "steelblue3", lwd = 2, lty = 2)
legend("topleft" , legend = c("Threshold" , "Observed\n  statistic") , lty = 2 ,
       bty = "n", col = c("red","steelblue3"), lwd = 2, cex=1.2)

```

##### Comment

The first plot shows the two distributions of the Kolmogorov-Smirnov scores for both samples. The second one shows the distribution of the statistics under the null hypothesis. The red line is the quantile at level $1- \alpha = q_{\alpha}$  of the distribution, and the blue line is the observed value of the statistic we get with our data.

Since the decision rule is to reject the null hypothesis if the observed value of Kolmogorov Smirnov statistics is greater than $q_{\alpha}$, we reject the null hypothesis. Data don't show enough evidence in order to reject the null hypothesis, we claim that the distribution of the scores is the same

#### Perform the test with  TD and ASD subjects respectively with the labels 0 and 1


```{r performing the Friedmann procedure different dataset, warning = FALSE, eval=FALSE}
 dim(x_data)
u_2 <- as.matrix(x_data)
label <- c(rep(0, 50), rep(1, 43))

n0 <- 50
n1 <- 43

model_2 <- glmnet(
  x = u_2[, 1:(k - 1)],
  y = label,
  family = "gaussian",
  alpha = 0.1,
  maxit = 100
)
sc_2 <- predict(model_2, u_2[, 1:(k - 1)] , s = 0.7)
k2 <- ks.test(sc_2[1:50] , sc_2[51:93])$statistic

bb <-
  Friedman_procedure(
    P = 100,
    x_data = x_data[1:50, ],
    z_data = x_data[51:93, ],
    permut = T
  )
k2 < quantile(bb, 0.95)

```


```{r plot scores different , warning=FALSE , echo= FALSE, fig.width=10,fig.height=5.5}
load("data/score_different.RData")
load("data/test_4_different.RData")
load("data/k_obs_different.RData")

par(mfrow = c(1,2), col='white', col.axis='white', col.lab='white', col.main='white', fg='white', bg="#2B3E50")


first <- density(score_different[1:50])
second <- density(score_different[51:93])

plot(second, xlim=c(0,1), xlab='scores', lwd=.1, las=1, main='Distribution of scores for TD and ASD subjects')
lines(first, lwd=.1)
legend('topleft', legend=c('TD', 'ASD'), pch = 15, col=dens_col, cex=1.2, bty='n')
polygon(first, col=dens_col[1])
polygon(second, col=dens_col[2])
grid(lty=2, lwd=.6)


hist(aa, main = TeX("Distribution of KS statistic under $H_0$", bold=T),
     col = "khaki1" , border = "#2B3E50" , freq = F , xlab = "ks statistic", xlim = c(0,1), ylim=c(0,8), las=1)
abline(v = quantile(aa, .95), col = "red", lwd = 2 , lty = 2)
abline(v = k_obs_1 , col = "steelblue3", lwd = 2, lty = 2)
legend("topleft" , legend = c("Threshold" , "Observed \n  statistic") , lty = 2 ,
       bty = "n", col = c("red","steelblue3"), lwd = 2, cex=1.2)

```

##### Comment

The first plot shows the two distributions of the Kolmogorov-Smirnov scores for both samples. The second one shows the distribution of the statistics under the null hypothesis. The red line is the quantile at level $1- \alpha = q_{\alpha}$  of the distribution, and the blue line is the observed value of the statistic we get with our data.

Since the decision rule is to reject the null hypothesis if the observed value of Kolmogorov Smirnov statistics is greater than $q_{\alpha}$, we reject the null hypothesis. Data show enough evidence in order to reject the null hypothesis, we claim that the distribution of the scores of asd and td subjects are different.

#### Conclusion and final remark

In the case of interest we are dealing with a noisy and small set of data, and the higher the dimension the lower the information effectively available to
resolve the difference and without a large enough amount of data, we fall into the curse of dimensionality problem, so to have a better and more consinstent performance of the test is worth to collect more data.