Strava-Analysis-Using-R/working_with_strava_r.Rmd at master · neuggs/Strava-Analysis-Using-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "strava_analysis"
author: "Frank Neugebauer"
date: "10/11/2019"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

# Required libraries
library(xlsx)
library(broom)
library(psych)
library("ggplot2")

```

# Working with Strava Data in R

The R work is pretty straightfoward. This R Markdown file uses the `all_data.xlsx` file created in the `get_strava_data.r` file. This must be created before anything within this Jupyter Notebook will work and you can't build that file using Jupyter Notebook (unless you happen to know how to capture the authorize step).

Note that the analysis focuses on cycling data, a fact that will manifest itself soon enough.

First things first - load the data by first create a data structure to specify the data types and then using that to open the `all_data.xlsx` file.
```{r echo=TRUE}
# All data
data_types <- c('character', 'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                'numeric', 'character', 'character', 'character', 'numeric',
                'numeric', 'numeric', 'numeric', 'numeric', 'numeric')

strava_data <- read.xlsx("./all_data.xlsx", 1, colClasses = data_types, header=TRUE)
```

## Subset the Data - Part I

Many of my own Strava activites are not cycling (e.g., running, swimming). Furthermore, some of the cycling activites were on stationary bikes without trackable power meters, which means important data is missing for those activites. The next step is to only include relevant cycling activities.

```{r echo=TRUE}
# Only ride data
ride_data <- subset(strava_data, (type == 'VirtualRide' | type == 'Ride') &
                      average_watts > 0)
```

## Some Descriptive Analytics

The following is a reusable function that outputs some important descriptive analytics, including:

* histogram
* qplot
* a general description with mean, median, max, min, etc.
* a simple kurtosis analysis
* skew analysis

```{r echo=FALSE}
perf_analysis <- function(field_label, data, field, bins) {
  gg <- ggplot(data, aes(field)) + geom_histogram(bins=bins) +
     stat_function(fun=dnorm, args=list(mean = mean(field, na.rm=TRUE),
                                        sd = sd(field, na.rm=TRUE)),
                   color='black', size=1) +
     ggtitle(field_label) + xlab(field_label) + ylab('Count')
  gg <- ggplot(data, aes(field)) + geom_histogram()
  print(gg)
  qq <- qplot(sample=field)
  print(qq)

  d <- describe(field)
  print(d)

  kurtosis <- d$kurtosis
  if (round(kurtosis, 2) > 0) {
    print(paste('Kurtosis is', round(kurtosis, 2),  '. Since it\'s greater than zero, there may be',
               'a heavily-tailed distribution. Ideally, this should be zero.'))
  } else if (round(kurtosis, 2) < 0) {
    print(paste('Kurtosis is', round(kurtosis, 2),  '. Since it\'s less than zero, there may be',
                'a light-tailed distribution. Ideally, this should be zero.'))
  } else {
    print('The Kurtosis is 0, which indicates a normal distribution.')
  }

  skew <- d$skew
  if (round(skew, 2) > 0) {
    print(paste('Skew is', round(skew, 2),  '. Since it\'s greater than zero, there may be ',
                 'a pile up of scores on the left of the distribution. Ideally, this should be zero.'))
  } else if (round(skew, 2) < 0) {
    print(paste('Skew is', round(skew, 2),  '. Since it\'s less than zero, there may be ',
                 'a pile up of scores on the right of the distribution. Ideally, this should be zero.'))
  } else {
    print('The Skew is 0, which indicates a normal distribution.')
  }
}
```

Call the function on key data elements.

```{r ECHO=TRUE}
a <- perf_analysis('Average Speed (MPH)', ride_data, ride_data$average_speed_mph, 15)
a <- perf_analysis('Distance (Miles)', ride_data, ride_data$distance_mi, 15)
a <- perf_analysis('Moving Time (Minutes)', ride_data, ride_data$moving_time, 15)
a <- perf_analysis('Elevation Gain (Feet)', ride_data, ride_data$elevation_gain_ft, 15)
a <- perf_analysis('Average Power (Watts)', ride_data, ride_data$average_watts, 15)
a <- perf_analysis('Average Heart Rate', ride_data, ride_data$average_heartrate, 15)
```

## More Analytics

The next couple steps outputs a scatter plot and box plot for distance and spee along with the gear (i.e., bike) used.

```{r echo=TRUE}
ggplot(ride_data, aes(x = distance_mi, y = average_speed_mph), color = factor(gear_id)) +
      geom_point()

# boxplot with bikes
ggplot(data = ride_data, aes(x = factor(gear_id), y = average_speed_mph)) +
        geom_boxplot()
```

## Linear Model

A simple linear model is created to show the relationship between average speed and distance. Intuitively, speed should (on average) god down as distance goes up.

```{r echo=TRUE}
# Simple lm model - how distance affects speed
lm_speed_dist_gear_id <- lm(average_speed_mph ~ distance_mi + factor(gear_id),
                            data = ride_data)
summary(lm_speed_dist_gear_id)
lm_speed_dist_gear_id
```

The linear model shows that  distance does not impact speed as expected; it goes up. The lineral model also shows that
the gear has a greater impact (depending on the gear), both positively and negatively. This makes sense because a time trial bike will almost always increase speed, whereas a fat tire mountain bike generally slow speed.

Note that there are other factors - e.g., type of terrain - that are not entirely adequately accounted for. Power is probably a better measure overall.

## Parallel Slopes

Here we show how the categorical variable (gear) impacts the distance to speed linear model.
```{r echo=TRUE}
# try parallel slopes
# Augment the model
augmented_bikes <- augment(lm_speed_dist_gear_id)

# scatterplot, with color
lm_plot <- ggplot(augmented_bikes, aes(x = distance_mi, y = average_speed_mph,
                                       color = factor.gear_id.)) +
 geom_point()

# single call to geom_line()
lm_plot + geom_line(aes(y = .fitted))
print(lm_plot)
```