R_StatisticalProject/Latex.tex at main · cmq2002/R_StatisticalProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper]{article}
%\documentclass{exam}
\usepackage{chemformula}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{a4wide,amssymb,epsfig,latexsym,multicol,array,hhline,fancyhdr}
\usepackage{vntex}
\usepackage[english]{babel}
\usepackage{inputenc}
\usepackage{amsmath}
\usepackage{physics}
\usepackage{lastpage}
\usepackage[lined,boxed,commentsnumbered]{algorithm2e}
\usepackage[backend=biber, style=numeric, sorting=ynt]{biblatex}
\addbibresource{references.bib}
\usepackage{enumerate}
\usepackage{color}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{graphicx}							% Standard graphics package
\usepackage{tabularx, caption}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{rotating}
\usepackage{graphics}
\usepackage{geometry}
\usepackage{setspace}
\usepackage{epsfig}
\usepackage{tikz}
\usepackage{float}
\usepackage{longtable}
\usetikzlibrary{arrows,snakes,backgrounds}
\usepackage{hyperref}
\hypersetup{urlcolor=blue,linkcolor=black,citecolor=red,colorlinks=true,breaklinks=true}
%\usepackage{pstcol} 								% PSTricks with the standard color package

\counterwithin*{equation}{section}
\counterwithin*{equation}{subsection}


\newtheorem{theorem}{{\bf Theorem}}
\newtheorem{property}{{\bf Property}}
\newtheorem{proposition}{{\bf Proposition}}
\newtheorem{corollary}[proposition]{{\bf Corollary}}
\newtheorem{lemma}[proposition]{{\bf Lemma}}


\AtBeginDocument{\renewcommand*\contentsname{Contents}}
\AtBeginDocument{\renewcommand*\refname{References}}
%\usepackage{fancyhdr}
\setlength{\headheight}{40pt}
\pagestyle{fancy}
\fancyhead{} % clear all header fields
\fancyhead[L]{
 \begin{tabular}{rl}
    \begin{picture}(25,15)(0,0)
    \put(0,-8){\includegraphics[width=8mm, height=8mm]{Picture/hcmut.png}}
    %\put(0,-8){\epsfig{width=10mm,figure=hcmut.eps}}
   \end{picture}&
	%\includegraphics[width=8mm, height=8mm]{hcmut.png} & %
	\begin{tabular}{l}
		\textbf{\bf \ttfamily Ho Chi Minh City University of Technology}\\
		\textbf{\bf \ttfamily Faculty of Computer Science and Engineering}
	\end{tabular}
 \end{tabular}
}
\fancyhead[R]{
	\begin{tabular}{l}
		\tiny \bf \\
		\tiny \bf
	\end{tabular}  }
\fancyfoot{} % clear all footer fields
\fancyfoot[L]{\scriptsize \ttfamily Probability and Statistics Assignment's Report, Semester 212}
\fancyfoot[R]{\scriptsize \ttfamily Page {\thepage}/\pageref{LastPage}}
\renewcommand{\headrulewidth}{0.3pt}
\renewcommand{\footrulewidth}{0.3pt}


%%%
\setcounter{secnumdepth}{4}
\setcounter{tocdepth}{3}
\makeatletter
\newcounter {subsubsubsection}[subsubsection]
\renewcommand\thesubsubsubsection{\thesubsubsection .\@alph\c@subsubsubsection}
\newcommand\subsubsubsection{\@startsection{subsubsubsection}{4}{\z@}%
                                     {-3.25ex\@plus -1ex \@minus -.2ex}%
                                     {1.5ex \@plus .2ex}%
                                     {\normalfont\normalsize\bfseries}}
\newcommand*\l@subsubsubsection{\@dottedtocline{3}{10.0em}{4.1em}}
\newcommand*{\subsubsubsectionmark}[1]{}
\makeatother

\newcommand{\alert}[1]{\textcolor{blue}{#1}}
\newcommand\afterclassquestion{\renewcommand\questionlabel{\thequestion.\makebox[0pt]{$^\ast$}}}
\newcommand\standardquestion{\renewcommand\questionlabel{\thequestion.}}

\lstset{language=R,
    basicstyle=\small\ttfamily,
    stringstyle=\color{green},
    otherkeywords={0,1,2,3,4,5,6,7,8,9},
    morekeywords={TRUE,FALSE},
    deletekeywords={data,frame,length,as,character},
    keywordstyle=\color{blue},
    commentstyle=\color{green},
}


\begin{document}

\begin{titlepage}
\begin{center}
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY \\
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
\end{center}

\vspace{1cm}

\begin{figure}[h!]
\begin{center}
\includegraphics[width=3cm]{Picture/hcmut.png}
\end{center}
\end{figure}

\vspace{1cm}


\begin{center}
\textbf{\Huge Probability $\&$ Statistic-MT2013}    \\
\begin{tabular}{c}
\hline
\\
\begin{tabular}{c}
    \textbf{\Huge Assignment's Report} \\
    {}                               \\
    \textbf{\large Analyses The Relationships between Various CPU Specifications}\\
    {}                          \\
    \textbf{\large with Multi-factor ANOVA Test}\\
    {}              \\
    \textbf{\large and Multiple Linear Regression Models using R}\\
    {} \\
\end{tabular}
\\
\hline
\end{tabular}
\end{center}

\vspace{2.5cm}

\begin{table}[h]
\begin{tabular}{rrl}
\Large
\hspace{5 cm} & Lecturer: & Nguyễn Tiến Dũng\\
& Class: & CC03 \\
& Group: & 03 \\
& Students: & Cao Minh Quang - 2052221\\
& & Trần Cao Duy Trường -2052299  \\
& & Lâm Quang Khải - 2052128 \\
& & Hoàng Cao Quốc Thắng - 2050020 \\
& & Trương Huỳnh Đăng Khoa - 2053145\\
\end{tabular}
\end{table}

\begin{center}
{\footnotesize HO CHI MINH CITY, May 2022}
\end{center}
\end{titlepage}

\newpage
\tableofcontents
\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Member list and Workload}
\begin{table}[H]
\large
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
\multicolumn{1}{|c|}{\textbf{No.}} & \multicolumn{1}{c|}{\textbf{Full name}} & \multicolumn{1}{c|}{\textbf{Student ID}} & \multicolumn{1}{c|}{\textbf{Task}} & \multicolumn{1}{c|}{\textbf{Contribution}}\\
\hline

%%%%%Student 1%%%%%%%%%%
\multirow{2}{*}{1} &
\multirow{2}{*}{Cao Minh Quang} &
\multirow{2}{*}{2052221} & Tasks 6 + 7.1 &
\multirow{2}{*}{25\%}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 7.5 + 8.1 + 8.2 &
\multirow{2}{*}{}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 8.3.1 + 8.8 &
\multirow{2}{*}{}\\
\hline

%%%%% Student 2 %%%%%%%%%%%
\multirow{2}{*}{2} &
\multirow{2}{*}{Trần Cao Duy Trường} &
\multirow{2}{*}{2052299} & Tasks 3 + 4 &
\multirow{2}{*}{17\%}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 7.2 + 8.7 &
\multirow{2}{*}{}\\
\hline

%%%%% Student 3 %%%%%%%%%%%
\multirow{2}{*}{3} &
\multirow{2}{*}{Lâm Quang Khải} &
\multirow{2}{*}{2052128} & Task 2 + 4 &
\multirow{2}{*}{21\%}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 5.3.1 + 7.1  &
\multirow{2}{*}{}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & +  7.4 + 8.5 &
\multirow{2}{*}{}\\
\hline
%%%%% Student 4 %%%%%%%%%%%
\multirow{2}{*}{4} &
\multirow{2}{*}{Hoàng Cao Quốc Thắng} &
\multirow{2}{*}{2050020} & Tasks 5.1 + 5.2 &
\multirow{2}{*}{18\%}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 5.3.2 + 7.3 + 8.4 &
\multirow{2}{*}{}\\
\hline
%%%%% Student 5 %%%%%%%%%%%
\multirow{2}{*}{5} &
\multirow{2}{*}{Trương Huỳnh Đăng Khoa} &
\multirow{2}{*}{2053145} & Tasks 2 + 5.3.3 &
\multirow{2}{*}{19\%}\\

\multirow{2}{*}{} &
\multirow{2}{*}{} &
\multirow{2}{*}{} & + 7.4 + 8.3.2 + 8.6 &
\multirow{2}{*}{}\\
\hline

\end{tabular}
\end{table}


\newpage
%%%%%% CONTENT %%%%%%%%
\section{Introduction}
\begin{itemize}

    \item[]Our group uses a Kaggle dataset that contains a collection of Intel CPUs released from 2010 through 2021 for our data analysis project. We hope to demonstrate the impact of cores, threads, bus speed, cache size, maximum memory, and maximum temperature on our computers' performance by analyzing the data from this dataset. We'll explain what these features are and how they relate to one another to help you transition into this computing subject.

    \item[]To begin, "CPU" stands for "Central Processing Unit," which is also referred to as a central processor. This electronic circuitry performs basic arithmetic, logic, controlling, and input/output operations specified by the instructions in the program. We can think of these actions as adding and removing data, as well as moving data around. A CPU used to be just one processing unit that could only handle one instruction at a time.

    \item[]However, since operating systems and programs have a lot more data and provide considerably more instructions for CPUs, we now have multiple processing units in one processor which are referred as cores. This means one processor can process multiple instructions at a time which significantly increases the speed of the CPU. For even better performance and multi-tasking, cores are split into threads. And its meaning is the same as slitting the central processor into cores.

    \begin{figure}[H]
        \centering
        \includegraphics[height=8cm]{Picture/intro_2.1.png}
        \caption{Cores and threads illustration.}
        \label{2.1}
    \end{figure}

    \item[]The figure about bus in our dataset refers to the front side bus (FSB), which connect the CPU to the memory controller to manage the flow of data going to and from the computer's main memory(RAM/ROM). Therefore, the higher speed of FSB, the better computer’s performance we have.

    \item[]What is \textbf{cache} ? Cache is a small amount of memory which is a part of the CPU - closer to the CPU than RAM. It is used to temporarily hold instructions and data that the CPU is likely to reuse. The CPU control unit automatically checks cache for instructions before requesting data from RAM. This saves fetching the instructions and data repeatedly from RAM – a relatively slow process which might otherwise keep the CPU waiting. Transfers to and from cache take less time than transfers to and from RAM. The more cache there is, the more data can be stored closer to the CPU.


    \item[]Cache is graded as Level 1 (L1), Level 2 (L2) and Level 3 (L3):
    \begin{itemize}

        \item \textbf{L1} is usually part of the CPU chip itself and is both the smallest and the fastest to access. Its size is often restricted to between 8 KB and 64 KB.

        \item \textbf{L2} and \textbf{L3} caches are bigger than \textbf{L1}. They are extra caches built between the CPU and the RAM. Sometimes L2 is built into the CPU with L1. L2 and L3 caches take slightly longer to access than \textbf{L1}. The more \textbf{L2} and \textbf{L3} memory available, the faster a computer can run.

        \begin{figure}[H]
            \centering
            \includegraphics[height=8cm, width=13cm]{Picture/cache_size.jpg}
            \caption{Cache memory level.}
            \label{2.1}
        \end{figure}
    \end{itemize}

    \item[]Not a lot of physical space is allocated for cache. There is more space for RAM, which is usually larger and less expensive.

    \item[]Max Memory is the maximum amount of RAM this CPU will work with. For example, if we use 32GB max memory CPU , we can plug in 64GB RAM, but only 32GB will be used.

    \item[]Max Temperature is the maximum amount of degree this CPU can be sustained and work properly. If somehow the temperature of CPU is over the maximum range, it will reduce the CPU performance just to cool it down and maybe break the CPU if the temperature is too hot.
\end{itemize}

\section{Data Import}
\begin{itemize}
    \item[] In order to import data from .csv file and store it in an object the following code segment will be used:
    \begin{lstlisting}
    > setwd("C:/Users/Truong/Downloads/Documents")
    > intel <- read.csv ("IntelProcessors.csv", header = TRUE, sep = ",")
    > View(intel)
    \end{lstlisting}

    \item[] We used "setwd" instruction to set the directory to a folder where store our .csv file.
    Futhermore, we should also put our R-script file in the same path.

    \item[] The instruction read.csv then will import data from file "IntelProcessors.csv" and store it in an object named "Intel".

    \item[] The instruction View allow us to take a look at our data frame.

    \begin{figure}[H]
        \centering
        \includegraphics[width=12cm]{Picture/3.png}
        \caption{After cleaning data}
        \label{3.1}
    \end{figure}

    \item[] Our data set contains information of 1098 Intel CPUs from 2010 to 2021.
\end{itemize}

\section{Data cleaning}
\begin{itemize}

    \item[]After reading the input file, next thing we have to do is to check if the data contain empty cells. And because of that we have written code to clean each variables.
    \begin{lstlisting}
    # Data Cleaning
    > install.packages("tidyverse")
    > library(tidyr)
    > cleanIntel <- drop_na(intel)
    > View(cleanIntel)
    \end{lstlisting}

    \item[] The drop-na instruction will sequentially check whether each row has empty cells or not. If there is, the whole row will be deleted from the dataframe. A new object name "cleanIntel" will hold our data after this process.

    \item[] As we can see, almost half of the data frame has been removed due to empty entries.

    \begin{figure}[H]
        \centering
        \includegraphics[width=10cm]{Picture/4.png}
        \caption{After cleaning data}
        \label{4.1}
    \end{figure}

    \item[] Now we start to list out the range of each variable and its data type in "cleanIntel". These information may be helpful in later section. In order to do that, we repeatively call 2 functions \textbf{table} and \textbf{is.numeric} to for easy further examination.

    \begin{lstlisting}
    > is.numeric(cleanIntel$name)
    [1] FALSE

    > is.numeric(cleanIntel$launch_date)
    [1] FALSE

    > table(cleanIntel$cores)
      2   4   6   8  10  12  14  16  18  24  28  32  38
     98 216 100  60  20   6   4   4   5   3   3   1   1
    > is.numeric(cleanIntel$cores)
    [1] TRUE

    > table(cleanIntel$threads)
      2   4   6   8  12  16  20  24  28  32  36  48  56  64  76
      2 157  25 162  75  53  20   6   4   4   5   3   3   1   1
    > is.numeric(cleanIntel$threads)
    [1] TRUE

    > table(cleanIntel$bus_speed)
      0   4   5   8
      3  85 127 306
    > is.numeric(cleanIntel$bus_speed)
    [1] TRUE

    > table(cleanIntel$base_frequency)
      700  800 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900
        1    1    4    9    6    6    5    9   16   10   15   9
     2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000
       15    7   19   21   25   23   26   25   34   28   25
     3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200
       25   18   22   15   24   26   20   12    4   10    4    1
     4300
        1
    > is.numeric(cleanIntel$base_frequency)
    [1] TRUE

    > table(cleanIntel$turbo_frequency)
    1900 2000 2300 2400 2600 2700 2800 2900 3000 3100 3200 3300
       3    2    2    1    2    7    3    9   10   10   19   19
    3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400
      24   26   27   18   30   38   25   26   16   17   32
    4500 4600 4700 4800 4900 5000 5100 5200 5300
      30   27   16   21   15   18   11    8    9
    > is.numeric(cleanIntel$turbo_frequency)
    [1] TRUE

    > table(cleanIntel$cache_size)
     2048  3072  4096  6144  8192  8448  9216 10240 11264 12288
        2    38    56   116    93     8    27     2     2    74
     14080 15360 16384 16896 19712 20480 21504 22528 24576
         2     5    44     6    11    15     1     3     3
    25344 33792 36864 39424 49152 58368
        5     2     1     3     1     1
    > is.numeric(cleanIntel$cache_size)
    [1] TRUE

    > table(cleanIntel$max_memory_size)
    16777216    33554432    67108864 67350036.48  67580723.2
          52         124         121           2           1
    134217728   268435456   536870912  1073741824
          188           4           8          13
    2147483648  4294967296
             3           5
    > is.numeric(cleanIntel$max_memory_size)
    [1] TRUE

    > table(cleanIntel$max_temp)
       59    61    62    64    65    66 66.35  66.8    68    70
        1     3     2     6     2     5     6    10     5     1
       71 71.35 71.45    72 72.72    73 74.04    76    77
        8     9     5     2    11     1     1     2     2
       78    80    82    84    85    86    88    92    94    95
        2     5     2     1     2     2     1     3     3     3
       98    99   100   102   105
        1     1   391     1    21
    > is.numeric(cleanIntel$max_temp)
    [1] TRUE
    \end{lstlisting}
\end{itemize}

\section{Data Visualization}
\subsection{Transformation}
\begin{itemize}
    \item[] In order to analyze the data in the dataset we have to remove several unwanted categories such as index of each object, the name and launch data.
    \begin{lstlisting}
    > newIntel <- cleanIntel[,c("cores", "threads", "bus_speed",
            "base_frequency", "turbo_frequency", "cache_size",
            "max_memory_size", "max_temp")]
    > head(newIntel)
    \end{lstlisting}

    \item[] By doing this, We only selected some attributes to compute the statistics. In this case "cores", "threads", "bus\_speed", "base\_frequency", "turbo\_frequency", "cache\_size", "max\_memory\_size", "max\_temp" is selected in new data frame called "newIntel".
    \item[] The Result is:

    \begin{figure}[H]
        \centering
        \includegraphics[width=10cm]{Picture/newIntel.png}
        \caption{Comparison amongst 3 data frames.}
    \end{figure}

    \item[] As we can see, the number of variable between newIntel and clean Intel is reduce by 3 where "id", "name" and "launch\_date" have been dropped.

\end{itemize}

\subsection{Descriptive Statistics}
\begin{itemize}
    \item [] Since considering Statistic, there are some crucial values that we need to compute such as min, max, median, mean, var, sum and so on.

    \begin{lstlisting}
    > install.packages("pastecs")
    > library(pastecs)
    > stat.desc(newIntel[, c(1,2,3,4,5,6,7,8)]) %>% round(4)
    \end{lstlisting}

    \item[] And the result is:

    \begin{figure}[H]
        \centering
        \includegraphics[width= 10cm]{Picture/De_stat.png}
        \caption{Some significant descriptive Statistics.}
    \end{figure}
\end{itemize}

\subsection{Graphs}
\subsubsection{Histogram for The Number of Cores and Threads}
\begin{itemize}
    \item[] We will use \textbf{histogram graph} to describe the numbers of cores and threads in the dataset.

    \begin{lstlisting}
    > #Histograms of Cores and Threads
    > gghistogram(cleanIntel, x = "cores", fill = "blue",
        add = "mean", rug = TRUE, add_density = TRUE)
    > gghistogram(cleanIntel, x = "threads", fill = "red",
        add = "mean", rug = TRUE, add_density = TRUE)
    \end{lstlisting}

    \begin{figure}[H]
        \centering
        \includegraphics[width= 10cm]{Picture/cores-hist.png}
        \caption{Histogram of Cores.}
    \end{figure}

    \begin{figure}[H]
        \centering
        \includegraphics[width= 10cm]{Picture/threads_hist.png}
        \caption{Histogram of Threads.}
    \end{figure}
\end{itemize}

\subsubsection{Base Frequency, Turbo Frequency and Bus speed related to The Number of Cores and Threads}
\begin{itemize}
    \item [] The relationship between each pair Cores and Threads with Base Frequency, Turbo Frequency and Bus speed also play a crucial role in analyzing the data set. In each pair, we will use strip chart to illustrate the connection amongst them.
    \begin{lstlisting}
    #strip chart for cores
    > ggstripchart(newIntel,x ='turbo_frequency',y='cores',color='cores')
    > ggstripchart(newIntel,x ='base_frequency',y='cores',color ='cores')
    > ggstripchart(newIntel,x ='bus_speed',y='cores',color='cores')
    \end{lstlisting}
    \begin{figure}[H]
        \centering
        \includegraphics[height=3cm, width=14cm]{Picture/5.3.3/core-turbo.png}
        \includegraphics[height=3cm, width=14cm]{Picture/5.3.3/core-basef.png}
        \includegraphics[height=6cm, width=10cm]{Picture/5.3.3/core-bus.png}
        \caption{Box plot for cores in relation with others.}
    \end{figure}
    \begin{lstlisting}
    #strip chart for threads
    > ggstripchart(newIntel,x ='turbo_frequency',y='threads',color='threads')
    > ggstripchart(newIntel,x ='base_frequency',y='threads',color='threads')
    > ggstripchart(newIntel,x ='bus_speed',y='threads',color='threads')
    \end{lstlisting}

    \begin{figure}[H]
        \centering
        \includegraphics[height=3cm, width=14cm]{Picture/5.3.3/thread-turbo.png}
    \end{figure}

    \begin{figure}[H]
        \centering
        \includegraphics[height=3cm, width=14cm]{Picture/5.3.3/thread-base.png}
        \includegraphics[height=6cm, width=10cm]{Picture/5.3.3/thread-bus.png}
        \caption{Box plot for thread in relation with others.}
    \end{figure}
\end{itemize}


\subsubsection{Box Plot for The Cache Size, Maximum Memory Size and Bus Speed}
\begin{itemize}

    \item[] We will use \textbf{box plot} to show off distribution of cache\_size, max\_memory\_size, max\_temp in range $(0, 4e+09)$ of KB, it also considers the mean value of these categories. The two packages for this work are \textbf{tidyr} and \textbf{ggplot2}.


    \begin{lstlisting}
    #Box plot for cache_size
    > gathered <- newIntel %>%
    > pivot_longer(c(cache_size), values_to="KB")
    > ggplot(gathered,aes(, y = KB)) + geom_boxplot()
        + labs(x="Cache size", y ="KB" )
        + geom_boxplot(fill = 'red')
    \end{lstlisting}

     \begin{figure}[H]
        \centering
        \includegraphics[width=6cm]{Picture/cache_boxplot.png}
        \caption{Box plot for cache\_size.}
        \label{5.3.3.2}
    \end{figure}

    \begin{lstlisting}
    #Box plot for max_memory_size
    > gathered <- newIntel %>%
    > pivot_longer(c(max_memory_size), values_to="KB")
    > ggplot(gathered,aes(, y = KB)) + geom_boxplot()
        + labs(x="Max memory size", y ="KB" )
        + geom_boxplot(fill = 'red')
    \end{lstlisting}


    \begin{figure}[H]
        \centering
        \includegraphics[width = 6cm]{Picture/memory_boxplot.png}
        \caption{Box plot for  max\_memory\_size.}
        \label{5.3.3.2}
    \end{figure}
    \newpage

     \begin{lstlisting}
    #Box plot for bus_speed
    > gathered <- newIntel %>%
    > pivot_longer(c(bus_speed), values_to="GT_s")
    > ggplot(gathered,aes(, y = GT_s)) + geom_boxplot()
        + labs(x="Bus speed", y ="GT/s" ) + geom_boxplot(fill = 'red')
    \end{lstlisting}


    \begin{figure}[H]
        \centering
        \includegraphics[width=6cm]{Picture/bus_boxplot.png}
        \caption{Box plot for bus\_speed.}
        \label{5.3.3.2}
    \end{figure}
\end{itemize}

\section{Theoretical Basis}

\subsection{Multi-factor ANOVA Test}

\subsubsection{Basic Concept of Two-way ANOVA}
\begin{itemize}
    \item[] A two-way ANOVA is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables. The usage of a two-way ANOVA is to know how two independent variables, in combination, affect a dependent variable.

    \item[] \textbf{How does the ANOVA test work ?}
    \begin{itemize}
        \item[] ANOVA tests for significance using the F-test for statistical significance. The F-test is a group-wise comparison test, which means it compares the variance in each group mean to the overall variance in the dependent variable.

        \item[] If the variance within groups is smaller than the variance between groups, the F-test will find a higher F-value, and therefore a higher likelihood that the difference observed is real and not due to chance.

        \item[] A two-way ANOVA with interaction tests three null hypotheses at the same time:
        \begin{itemize}
            \item[1.] There is no difference in group means at any level of the first independent variable.
            \item[2.] There is no difference in group means at any level of the second independent variable.
            \item[3.] The effect of one independent variable does not depend on the effect of the other independent variable (no interaction effect).
        \end{itemize}
    \end{itemize}

    \item[] \textbf{Assumptions of the two-way ANOVA}
    \begin{itemize}
        \item[1.] Normally-distributed dependent variable: The values of the dependent variable should follow a bell curve.

        \item [2.] Homogeneity of variance (or Homoscedasticity): The variation around the mean for each group being compared should be similar among all groups.

        \item[3.] Independence of observation: Independent variables should not be dependent on one another (i.e. one should not cause the other). This is impossible to test with categorical variables – it can only be ensured by good experimental design. In addition, the dependent variable should represent unique observations – that is, your observations should not be grouped within locations or individuals.
    \end{itemize}
\end{itemize}

\subsubsection{Find the best-fit model}
\begin{itemize}

    \item[] When doing the research, we may build up many ANOVA models to explain the data. Usually, we will want to use the best-fit model, which is the best explains the variation in the dependent variable.

    \item[] The Akaike information criterion (AIC) is good test for model fit. AIC calculates the information value of each model by balancing the variaion explained against the number of parameters used.

    \item[] In AIC model selection, we compare the information value of each model and choose the one with the lowest AIC value (a lower number means more information explained).
\end{itemize}

\subsubsection{Levene Test for Homoscedasticity of Variance}
\begin{itemize}
    \item[] In statistics, Levene’s test is an inferential statistic used to evaluate the equality of variances for a variable determined for two or more groups. Some standard statistical procedures find that variances of the populations from which various samples are formed are equal. Levene’s test assesses this assumption.

    \item[] It examines the null hypothesis that the population variances are equal called homogeneity of variance or homoscedasticity. It compares the variances of k samples, where k can be more than two samples.

    \item[] It’s an alternative to Bartlett’s test that is less sensitive to departures from normality.

    \item[] Given a variable $Y$ with sample size of $N$ is divided into $k$ subgroups, where $N_i$ is the sample size of the $i^{th}$ subgroup, the Levene Test statistic is defined as:
    \begin{itemize}
        \large
        \centering
        \item[] $W = \dfrac{N-k}{k-1}.\dfrac{\sum_{i=1}^{k} N_i(\overline{Z_i}-\overline{Z})^2}{\sum_{i=1}^{k} \sum_{j=1}^{N_i}(Z_{ij}-\overline{Z_i})^2}$
    \end{itemize}

    \item[] where $Z_{ij}$ can have one of the following three definitions:
    \begin{itemize}
        \item[1.] $Z_{ij} = \left| Y_{ij} - \overline{Y_i} \right|$, where $\overline{Y_i}$ is the \textbf{mean} of the $i^{th}$ subgroup.

        \item[2.] $Z_{ij} = \left| Y_{ij} - \hat{Y_i} \right|$, where $\hat{Y_i}$ is the \textbf{median} of th $i^{th}$ subgroup.

        \item[3.] $Z_{ij} = \left| Y_{ij} - \overline{Y_i^{'}} \right|$, where $\overline{Y_i^{'}}$ is the $10 \%$ \textbf{trimmed mean} of th $i^{th}$ subgroup.
    \end{itemize}

    \item[] $\overline{Z_i}$ are the group means of the $Z_{ij}$ and $\overline{Z}$ is the overall mean of the $Z_{ij}$

    \item[] The three choices for defining Zij determine the robustness and power of Levene's test. By robustness, we mean the ability of the test to not falsely detect unequal variances when the underlying data are not normally distributed and the variables are in fact equal. By power, we mean the ability of the test to detect unequal variances when the variances are in fact unequal.
\end{itemize}

\subsubsection{Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test}
\begin{itemize}
    \item[] ANOVA will tell us if there are differences among group means, but not what the differences are. To find out which groups are statistically different from one another, we can perform a Tukey’s Honestly Significant Difference (Tukey’s HSD) post-hoc test for pairwise comparisons.

    \item[] Tukey’s test compares the means of all treatments to the mean of every other treatment and is considered the best available method in cases when confidence intervals are desired or if sample sizes are unequal.

    \item[] The test statistic used in Tukey’s test is denoted $q$ and is essentially a modified $t$-statistic that corrects for multiple comparisons. $q$ can be found similarly to the $t$-statistic:
    \begin{itemize}
        \centering
        \large
        \item[] $q_{\alpha,k,N-k}$
    \end{itemize}

    \item[] The studentized range distribution of $q$ is defined as:
    \begin{itemize}
        \centering
        \large
        \item[] $q_s=\dfrac{Y_{max}-Y_{min}}{se}$
    \end{itemize}

    \item[] where, $Y_{max}$ and $Y_{min}$ are the largest and smallest means of the two groups being compared. $se$ is defined as the standard error of the entire test.
\end{itemize}

\subsection{Kruskal-Wallis Test when countering Assumptions' Failures in ANOVA}
\subsubsection{Kruskal-Wallis Test}
\begin{itemize}
    \item[] \textbf{Definition}
    \begin{itemize}
        \item[] Kruskal-Wallis test (also known as Kruskal-Wallis H test or Kruskal–Wallis ANOVA) is a non-parametric (distribution free) alternative to the one-way ANOVA.

        \item[] Kruskal-Wallis test is useful when the assumptions of ANOVA are not met or there is a significant deviation from the ANOVA assumptions. If the data meets the ANOVA assumptions, it is better to use ANOVA as it is a little more powerful than non-parametric tests.

        \item[] Kruskal-Wallis test used for comparing the differences between two or more groups. It is an extension to the Mann Whitney U Test, which is used for comparing two groups. It compares the mean ranks (medians) of groups.

        \item[] Kruskal-Wallis test does not assume any specific distribution (such as normal distribution of samples) for calculating test statistics and p values.

        \item[] The sample mean ranks or medians are compared in the Kruskal-Wallis test, which distinguishes it from the ANOVA, which compares sample means. Medians are less sensitive to outliers than means.
    \end{itemize}

    \item[] \textbf{Kruskal-Wallis' Assumptions}
    \begin{itemize}
        \item[] The independent variable should have two or more independent groups.

        \item[] The observations from the independent groups should be randomly selected from the target populations.

        \item[] Observations are sampled independently from each other (no relation in observations between the groups and within the groups) i.e., each subject should have only one response.

        \item[] The dependent variable should be continuous or discrete.
    \end{itemize}

    \item[] \textbf{Kruskal-Wallis test Hypotheses}
    \begin{itemize}
        \item[] If each group distribution is not the same,
        \begin{itemize}
            \item[] Null hypothesis: All group mean are equal. \textbf{vs} Alternative hypothesis: At least, one group mean different from other groups
        \end{itemize}
        \item[] In terms of medians (when each group distribution is same),
        \begin{itemize}
            \item[] Null hypothesis: Population medians are equal. \textbf{vs} Alternative hypothesis: At least, one population mean different from other populations
        \end{itemize}
    \end{itemize}

    \item[] \textbf{Kruskal-Wallis test statistic}
    \begin{itemize}
        \centering
        \large
        \item[] $H = \left(\dfrac{12}{N(N+1)}\sum_{j=1}^{k} \dfrac{R_j^2}{n_j}-3(N+1)\right)$
    \end{itemize}

    \item[] where,
    \begin{itemize}
        \item[] N is the total observation in all groups (total sample size)
        \item[] k is the number of groups
        \item[] $n_j$ is sample size for the $i^{th}$ group
        \item[] $R_j$ is the sum of ranks of $j^{th}$ group
    \end{itemize}

    \item[] $H$ is apprximately chi-squared distributed with $df=k-1$. The $p$-value is calculated based on the comparison between the critical value and the $H$ value. If $H\geqslant$ critical value, we can reject the null hypothesis and vice versa.
\end{itemize}

\subsubsection{The Epsilon-Squared Scale}
\begin{itemize}
    \item[] For the Kruskal-Wallis test, epsilon-squared is a method of choice for effect size measurement.

    \item[] An epsilon square of 0 would mean no differences (and no influence), while one of 1 would indicate a full dependency.
    \begin{itemize}
        \item[] $0.00 < 0.01$ - Negligible
        \item[] $0.01 < 0.04$ - Weak
        \item[] $0.04 < 0.16$ - Moderate
        \item[] $0.16 < 0.36$ - Relatively strong
        \item[] $0.36 < 0.64$ - Strong
        \item[] $0.64 < 1.00$ - Very strong
    \end{itemize}
\end{itemize}

\subsubsection{Dunn's Test for Multiple Comparisons}
\begin{itemize}
    \item[] When the results of a Kruskal-Wallis test are statistically significant, it is appropriate to conduct Dunn’s Test to determine exactly which groups are different.

    \item[] Dunn’s Test performs pairwise comparisons between each independent group and tells which groups are statistically significantly different at some significant level $\alpha$.

    \item[] Dunn's $z-test$ statistic approximates the exact rank-sum test statistics by using the mean rankings of the outcome in each group from the preceding Kruskal-Wallis test. To compare group A and B, we calculate:
    \begin{itemize}
        \centering
        \large
        \item[]  $z_i = \dfrac{y_i}{\sigma_i}$
    \end{itemize}

    \item[] where, $i$ is one of the 1 to $m$ multiple comparisons, $y_i = \overline{W_A}-\overline{W_B}$ and $\sigma_i$ is the standard deviation of $y_i$, given by:
    \begin{itemize}
        \large
        \centering
        \item[] $\sigma_i = \sqrt{\left[\dfrac{N(N+1)}{12}-\dfrac{\sum_{s=1}^{r}\tau^3_s-\tau_s}{12(N-1}\right]\left(\dfrac{1}{n_A}+\dfrac{1}{n_B}\right)}$
    \end{itemize}
    \item[] where, $N$ is the total number of observation across all groups, $r$ is the number of tied ranks, and $\tau_s$ is the number of observations tied at the $s^{th}$ specific tied value. When there are no ties, the term with the summation in the denominator equals zero, and the calculation will be simplified considerably.


    \item[] \textbf{Multiple-Comparison Adjustments}
    \begin{itemize}
        \item[] There are several methods for the adjustments such as the Bonferroni adjustment, Holm's stepwise adjustment, Holm-Sidak's stepwise adjustment and Benjamini-Hochberg stepwise adjustment.

        \item[] We will use the Benjamini-Hochberg method since this is a really powerful tool to decrease the false discovery rate and our data set seem to be quite large, sometimes small p-values (less than $5\%$) happen by chance, which could lead to incorrectly reject the true null hypotheses.
    \end{itemize}

\end{itemize}

\subsection{Multiple Linear Regression Model}
\subsubsection{Basic Concept}
\begin{itemize}

    \item [] Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. Multiple linear regression can be use when we want to know:
    \begin{itemize}
        \item[1.] How strong the relationship is between two or more independent variables and one dependent variable.

        \item[2.] The value of the dependent variable at a certain value of the independent variables.
    \end{itemize}

    \item[] \textbf{Assumptions of multiple linear regression}
    \begin{itemize}
        \item[] \textbf{Linearity}: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor. The Residuals vs Fitted and Normal Q-Q graph are used to ensure.

        \item[] \textbf{Normality}: The data follows a normal distribution. This assumption is confirmed by the usage of Shapiro-Wilk Test and Normal Q-Q graphs as well as the Residual Histogram.

        \item[] \textbf{Independence of observations (Multicollinearity)}: In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to make sure these before developing the regression model. If two independent variables are too highly correlated ($r^{2} > ~0.6$), then only one of them should be used in the regression model. We will verify this by the Correlation Matrix and Variance Inflation Factor (Vif).

        \item[] \textbf{Homogeneity of variance (Homoscedasticity)}: the size of the error in our prediction doesn’t change significantly across the values of the independent variable. Or simply, standard deviation are equal for all points. The Breusch-Pagan Test, Scale-Location and Residuals vs Fitted Graph are useful when supporting the validation of this assumption.
    \end{itemize}

    \item[] \textbf{Multiple linear regression formula}
    \begin{itemize}
        \centering
        \large
        \item[] $y=\alpha + \beta_1X_1 + ...+ \beta_nX_n + \epsilon$, where:
        \begin{itemize}
            \item[] \textbf{y}: the predicted value of the dependent variable.

            \item[] \textbf{$\alpha$}: the y-intercept.

            \item[] \textbf{$\beta_i, i=1,2...n$}: the regression coefficient of the $i^{th}$ variable $X_i$.

            \item[] \textbf{$\epsilon$}: model error.
        \end{itemize}
    \end{itemize}

\end{itemize}
\subsubsection{Interpreting Linear Regression Model Output in R}
\begin{itemize}
    \item[] Consider the following example:
    \begin{lstlisting}[language=R]
    Call:
    lm(formula = dist ~ speed.c, data = cars)

    Residuals:
        Min      1Q  Median      3Q     Max
    -29.069  -9.525  -2.272   9.215  43.201

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)  42.9800     2.1750  19.761  < 2e-16 ***
    speed.c       3.9324     0.4155   9.464 1.49e-12 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Residual standard error: 15.38 on 48 degrees of freedom
    Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438
    F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
    \end{lstlisting}

    \item[] Now, we will briefly explain each component of the model ouput:

    \item[] \textbf{Formula Call}: The first item shown in the output is the formula R used to fit the data.

    \item[] \textbf{Residuals}: The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, we should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points.

    \item[] \textbf{Coefficients}:
    \begin{itemize}
        \item[] \textbf{Estimate}: The coefficient Estimate contains two rows. The first one is the intercept and the second row in the Coefficients is the slope

        \item[] \textbf{Standard Error}: The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.

        \item[] \textbf{t-value}: The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist.

        \item[] \textbf{Pr(>t)}: The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance. Typically, a p-value of $5\%$ or less is a good cut-off point. The ‘Signif. Codes’ associated to each estimate. Three stars (or asterisks) represent a highly significant p-value.
    \end{itemize}

    \item[] \textbf{Residual Standard Error}: Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term $\epsilon$.

    \item[] \textbf{Multiple R-squared, Adjusted R-squared}: The R-squared $(R^{2})$ statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. $R^2$ is a measure of the linear relationship between our predictor variable (speed) and our response / target variable (dist). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In multiple regression settings, the $R^2$ will always increase as more variables are included in the model. That’s why the adjusted $R^2$ is the preferred measure as it adjusts for the number of variables considered.

    \item[] \textbf{F-Statistc}: F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis ($H_0$ : There is no relationship between speed and distance). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship between predictor and response variables.
\end{itemize}

\subsubsection{Breusch-Pagan Test for Heteroscedasticity in Regression Models}
\begin{itemize}
    \item[] A Breusch-Pagan Test is used to determine if heteroscedasticity is present in a regression analysis. Derived from the Lagrange multiplier test principle, it tests whether the variance of the errors from a regression is dependent on the values of the independent variables. In that case, heteroskedasticity is present.

    \item[] The Breusch-Pagan test statistic is asymptotically distributed as $\chi_{p-1}^2$ under the null hypothesis of homescedasticity. As a result, we can calculate the test statistic
    \begin{itemize}
        \centering
        \large
        \item[] $\chi^2 = nR^2$
    \end{itemize}

    \item[], where $n$ is the total number of observations, $R^2$ is the R-squared of the new regression model that used the squared residuals as the response values.

    \item[] If the $p-value$ correspond to this Chi-Square test statistic is less than the significance level (i.e. $\alpha=0.05$ then reject the null hypothesis and heteroscedasticity is present. Otherwise, we fail to reject the null hypothesis. In this case, homoscedasticity is assumed to present.
\end{itemize}

\subsubsection{The Usage of Correlation Matrix}
\begin{itemize}
    \item[] In statistics, we’re often interested in understanding the relationship between two variables.

    \item[] One way to quantify this relationship is to use the Pearson correlation coefficient, which is a measure of the linear association between two variables. It has a value between -1 and 1 where:
    \begin{itemize}
        \item[] -1 indicates a perfectly negative linear correlation between two variables.

        \item[] 0 indicates no linear correlation between two variables.

        \item[] 1 indicates a perfectly positive linear correlation between two variables.
    \end{itemize}

    \item[] The further away the correlation coefficient is from zero, the stronger the relationship between the two variables.

    \item[] in some cases we want to understand the correlation between more than just one pair of variables. In these cases, we can create a correlation matrix, which is a square table that shows the the correlation coefficients between several variables.

    \item[] \textbf{So when to use a correlation matrix}
    \begin{itemize}
        \item[1.] A correlation matrix conveniently summarizes a dataset.
        \item[2.] A correlation matrix serves as a diagnostic for regression.
        \begin{itemize}
            \item[] One key assumption of multiple linear regression is that no independent variable in the model is highly correlated with another variable in the model.

            \item[] When two independent variables are highly correlated, this results in a problem known as multicollinearity and it can make it hard to interpret the results of the regression.

            \item[] One of the easiest ways to detect a potential multicollinearity problem is to look at a correlation matrix and visually check whether any of the variables are highly correlated with each other.
        \end{itemize}
        \item[3.] A correlation matrix can be used as an input in other analyses.
    \end{itemize}
\end{itemize}
\subsubsection{Multicollinearity Check with VIFs}
\begin{itemize}
    \item[] The variance inflation factor (VIF) quantifies the extent of correlation between one predictor and the other predictors in a model. It is used for diagnosing collinearity/multicollinearity. Higher values signify that it is difficult to impossible to assess accurately the contribution of predictors to a model.

    \item[] The variance inflation for a variable is then computed as:
    \begin{itemize}
        \centering
        \large
        \item[] $VIF = \dfrac{1}{1-R^2}$
    \end{itemize}
    \item[]where, $R^2$ is the R-squared statistic of the regression where the predictor of interest is predicted by all other predictor variables.

    \item[] A VIF value of 1 means that the predictor is not correlated with other variables.

    \item[] The higher the value, the greater the correlation of the variable with other variables. A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity
\end{itemize}


\subsection{Shapiro-Wilk Test for Normality}
\begin{itemize}

    \item[] The Shapiro-Wilk’s test or Shapiro test is a normality test in frequentist statistics. The null hypothesis of Shapiro’s test is that the population is distributed normally. It is among the three tests for normality designed for detecting all kinds of departure from normality.

    \item[] If the value of p is equal to or less than 0.05, then the hypothesis of normality will be rejected by the Shapiro test. On failing, the test can state that the data will not fit the distribution normally with $95\%$ confidence. However, on passing, the test can state that there exists no significant departure from normality.

    \item[] \textbf{Shapiro-Wilk's Test Formula}
    \item[] Suppose a sample, say $x_1, x_2,...,x_n$ has come form a normally distributed population. Then according to the Shapiro-Wilk's tests null hypothesis test.
    \begin{itemize}
        \centering
        \large
        \item[] $W =\dfrac{(\sum_{i=1}^{n} a_i.x_i)^2}{(\sum_{i=1}^{n}(x_i - \overline{X}))^2}$
    \end{itemize}
    \item[] where,
    \begin{itemize}
        \item[] $x_i$: the $i^{th}$ smallest number in the given sample.
        \item[] $\overline{X}= \dfrac{x_1+x_2+...+x_n}{n}$: the sample mean.
        \item[] $a_i$: coefficient that can be calulated as $(a_1,a_2,...,a_n)=\dfrac{m^TV^{-1}}{C}$.
        \item[] Here $V$ is the covariance matrix, $m$ and $C$ are the vector norms that can be calculated as $C=\| V^{-1}m \|$ and $m=(m_1,m_2,...,m_n)$.
    \end{itemize}
\end{itemize}

\subsection{Residuals vs Leverage}
\begin{itemize}
    \item[] A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential observations in a regression model.

    \item[] Each observation from the dataset is shown as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point.

    \item[] \textbf{Leverage} refers to the extent to which the coefficients in the regression model would change if a particular observation was removed from the data set. Observations with high leverage have a strong influence on the coefficients in the regression model. If we remove these observations, the coefficients of the model would change noticeably.

    \item[] \textbf{Standardized residuals} refer to the standardized difference between a predicted value for an observation and the actual value of the observation.

    \item[] Let's take a look of the following example:

    \begin{figure}[H]
        \centering
        \includegraphics[height=6cm]{Picture/diagnostic2-768x731.png}
        \caption{An example of the Residuals vs Leverage graph.}
        \label{6.5.1}
    \end{figure}

    \item[] If any point in this plot falls outside of Cook’s distance (the red dashed lines) then it is considered to be an influential observation. In this example, there are no points fall outside of the dashed line. This means that \textbf{this regression model does not have any influential points}

    \item[] On the other hand, suppose we had the following plot:

    \begin{figure}[H]
        \centering
        \includegraphics[height=6cm]{Picture/lev1-768x745.png}
        \caption{The existence of an influential point.}
        \label{6.5.2}
    \end{figure}

    \item[] A quick glance at the graph tell us that observation number 1 in the top right corner falls outside of the red dashed lines. This indicates that it is an influential point.
\end{itemize}