-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathAuto_data_toronto_cs.Rmd
More file actions
188 lines (144 loc) · 6.2 KB
/
Auto_data_toronto_cs.Rmd
File metadata and controls
188 lines (144 loc) · 6.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: 'Auto Accidents Toronto: Case Study'
author: "Mark Edney"
date: '2021-08-09'
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
## Introduction
This is a custom case study for the Capstone Project for the Google Data Analytics
course on Coursera. This case study outlines the steps to create and solve a bussiness
using a dataset.
## Ask
The business task is to research into traffic collisions for a tow trucking company
in Toronto. Where should tow trucks be stationed to most quickly respond to calls
in the most common collision sites? What are the best times to have employees working?
Are there times in the year would it would be a good idea to hire contractors to
meet increased demand?
The main stakeholders are the owners of the tow truck company requesting data analysis.
An additional stakeholder is the police department as the are the owners of the dataset.
The database is the automotive traffic collisions dataset
for Toronto from the open data Toronto site found [here](https://open.toronto.ca/dataset/police-annual-statistical-report-traffic-collisions/).
## Prepare
The following short script is used to download the automotive accident data from
the open data Toronto database. The data is than saved as an RDS file. RDS files
are unique to R studio, which is a major disadvantage, but compress the data significantly
more than a csv file.
```{r}
library(opendatatoronto)
library(tidyverse)
library(lubridate)
auto.data <- show_package("ec53f7b2-769b-4914-91fe-a37ee27a90b3") %>%
list_package_resources() %>%
get_resource() %>%
as_tibble()
saveRDS(auto.data, "Auto_accidents_toronto.RDS")
auto.data <- readRDS("Auto_accidents_toronto.RDS")
```
The best way to observe the data is with the 'glimpse' function, which clearly outlines
the structure of the data.
```{r}
glimpse(auto.data)
```
## Process
Some errors are observed from the data import. Some of the columns can be dropped
as they don't add anything to the analysis.
```{r drop}
auto.data <- auto.data %>%
select(-c(OBJECTID, EventUniqueId, Division, Atom, ObjectId2, geometry))
```
Some of the column names are not very descriptive such as FTR_Collisions and PD_Collisions.
From the metadata descriptions, these values represent "Failure to remain at the scene"
and "Property Damage" respectively.
```{r rename}
auto.data <- auto.data %>%
rename("Left_scene" = FTR_Collisions, "Property_Damage" = PD_Collisions, "Injury" = Injury_Collisions)
```
The last three columns also only have yes/no possible values so they are better
represented as factors.
```{r factor}
auto.data <- auto.data %>%
mutate(Left_scene = as.factor(Left_scene), Property_Damage = as.factor(Property_Damage),
Injury = as.factor(Injury),
Day_of_Week = factor(Day_of_Week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")))
```
## Analyze
### Neighbourhoods
The first step it to identify the neighborhoods with the highest rate of collisions.
The neighborhood is used rather than the GPS coordinates as it greatly reduces the
results into manageable areas.
```{r locations}
Location <- auto.data %>%
count(Neighbourhood) %>%
arrange(-n)
```
The top 10 Neighborhoods for collisions can than be plotted:
```{r locationplot}
ggplot(Location[1:10,], aes(y = n, x = reorder(Neighbourhood, -n))) +
geom_bar(stat = "identity") +
coord_flip()
```
With the neighborhoods with the greatest collision response identified, it may be useful
to map them. There are many different way to find the neighborhoods, there may even be
shape files outlining the neighborhoods borders, but the easiest way would to simple
find the average of the Latitude and Longitude values when grouped by neighborhood.
This dataframe can than be join with the previous summary to add the count values.
```{r area_gps}
Area <- auto.data %>%
group_by(Neighbourhood) %>%
summarise(Longitude = mean(Longitude), Latitude = mean(Latitude)) %>%
left_join(Location)
```
### Dates
Trends for the date can be observed from a plot of the days of the week against
the count. Any trend with the hour of the day can also be observed on the same
chart.
```{r date}
auto.data %>%
ggplot(aes(x = Day_of_Week, fill = factor(Hour))) +
geom_bar() +
coord_flip()
```
There is no clear trend with the previous plot as there are to many possible times
in the date. The time of day should be observed separately.
```{r time}
auto.data %>%
ggplot(aes(x = Hour)) +
geom_bar() +
coord_flip()
```
Finding the frequency of collisions by month would show any seasonal variation such
as the commonly accepted theory of increased collisions during snowy months.
```{r season}
auto.data %>%
ggplot(aes(x = reorder(Month, -month(OccurrenceDate)), fill = Month)) +
geom_bar() +
labs(x="Months") +
coord_flip()
```
### Chance of injuries
The last thing to look at is the probability for each of the factors within the
data.
```{r injury}
mean(auto.data$Injury=="YES")*100
mean(auto.data$Left_scene=="YES")*100
mean(auto.data$Property_Damage=="YES")*100
```
## Share
The Visual for the case study are included in the Analyze stage of this report but
also include the following tableau dashboard found [here](https://public.tableau.com/app/profile/mark.edney/viz/Collisions_Toronto_CS/Dashboard1).
## Act
From the analyze provide in this report, and the visuals to support it, it is clear
that the best advice for the client would be to focus on the regions with the highest
collisions such as the waterfront communities. It is also clear that collisions
most often occur during the week ramping up to Friday. It would be best to operate
between the hours of 8 am to 7 pm. There doesn't seem to be any seasonal variation
so there is no need to contract employees for the winter season. There is a small
chance of injury for each collision but it maybe a good idea for each tow truck
driver to complete some first aid training.
Going further, I would look into locations that have access to highways as that would
enable tow truck drivers easier access to a large area of accidents. With that in mind
it maybe beneficial to include the average speed limit for each neighborhood.