GoogleDataAnalyticsCapstoneProject/testCaseStudyBellaBeatReport.Rmd at main · Morris2801/GoogleDataAnalyticsCapstoneProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
---
title: "Case Study: How Can a Wellness Technology Company Play It Smart?"
author: "Mauricio Emilio Monroy González"
date: "December 18, 2024"
output: html_document
---

### Contents

1.  [Summary](#summary)\

2.  [Ask Phase: Business task](#ask-phase)\

3.  [Prepare Phase: Data Sources](#prepare-phase)\

4.  [Process Phase: Data Manipulation Documentation](#process-phase)\
    4.1 [Library and dataframe import](#library-and-dataframe-import)\
    4.2 [Cleaning](#cleaning)\

5.  [Analyze and Share Phase](#analyze-and-share-phase)\
    5.1 [Trending Measurement Features](#trending-measurement-features)\
    5.2 [User Demographics and Patterns](#user-demographics-and-patterns)\
    5.3 [Correlation Between Variables](#correlation-between-variables)\
    5.4 [Consistency in Use of Smart-Device](#consistency-in-use-of-smart-device)\

6.  [Act Phase](#act-phase)\

7.  [References](#references)\

# Summary {#summary}

Based on data, Bellabeat can have a major leap in the health monitoring industry by innovating in smart-device tracking, taking advantage of trends, and exploiting potential opportunities with users and technology alike. In this report, data was extracted from a FitBit user log spanning two months of recordings from over 30 users across different categories. From there, in RStudio, data cleaning was carried out across multiple data frames, analysis conducted, and visualizations drawn, all so that Bellabeat could have solidly-based recommendations as to what to do in order to accomplish its business task.

# Ask Phase {#ask-phase}

*Business Task*

Bellabeat is a small but successful high-tech company that manufactures health-focused smart products, particularly towards women. With Points of Sale (POS) both in e-commerce and physical retailing, paired with the growing market trend towards smart health monitoring, the company’s leadership believes new growth opportunities for Bellabeat can be achieved by better understanding market data and taking data-driven decisions. By identifying trends in smart device usage, and extrapolating towards Bellabeat’s own customers, the company could modify their product development and marketing skim_without_chartsategies to better appeal and solve their customers’ needs. As such, Bellabeat would gain a competitive advantage by keeping up with what is demanded with what and how they innovate in turn. Fitness trackers and smartwatches now include features like heart rate monitoring, sleep tracking, ECG capabilities, blood oxygenation, and tests are research is being conducted to also include blood sugar measurements. The following general statistics help give a big picture regarding the positive environment in which Bellabeat could potentially thrive:

-   The global wearable technology market in health and fitness was projected to reach \$60 billion USD by 2023 (Psico-Smart Editorial Team. 2024)

-   Almost 1 in 3 Americans use a wearable device to track their health and fitness (NIH, 2023)

-   In 2023, 75% of healthcare providers considered that patients using wearables displayed a better engagement with their fitness goals (Psico-Smart Editorial Team. 2024)

# Prepare Phase {#prepare-phase}

*Data Sources Description*

To carry out the analysis required to help Bellabeat with their growth-oriented business task, the “FitBit Fitness Tracker Data” were used.

| Description | FitBit Fitness Tracker Data |
|------------------------|------------------------------------------------|
| Data Storage Location | Kaggle |
| Data Organization | 18 .csv files: "dailyActivity", "dailyCalories", "dailyIntensities", "dailySteps", "heartrate_seconds", "hourlyCalories", "hourlyIntensities", "hourlySteps", "minuteCaloriesNarrow", "minuteCaloriesWide", "minuteIntensitiesNarrow", "minuteIntensitiesWide", "minuteMETsNarrow", "minuteSleep", "minuteStepsNarrow", "minuteStepsWide", "sleepDay", and "weightLogInfo" (some are separated per month in 2 separate downloaded folders) |
| Issues with bias or credibility | Based on only 30 user data outputs. “Variation between output represents use of different types of Fitbit trackers and individual tracking behaviours / preferences” (Kaggle, 2024) |
| ROCCC Criterion (Reliable/Original/Comprehensive/Current/Cited) | Not reliable due to sampling biases / Original primary source data / Comprehensive in scope of categories included / Cited -- 5/6 |
| Licensing, privacy, security, accessibility | Dataset CC0: Public Domain. “Generated by respondents to a diskim_without_chartsibuted survey via Amazon Mechanical Turk between 03.12.2016-0.5.12.2016” (Kaggle, 2024) |
| Helpfulness in answering guiding question | This dataset provides multiple possible measurements obtainable from smart trackers, and also points towards activity use, how the gadgets are used, and the information that users are interested in |
| Are there problems with the data? | The sample from the dataset compared to the entire population to be considered is not really representative, and the dataset does not come pre-cleaned or merged between months. Also, there are csv pairs that are not complete. |

# Process Phase {#process-phase}

*Data Manipulation Documentation*

## Library and dataframe import {#library-and-dataframe-import}

The libraries used throughout this process were the tidyverse, lubridate, janitor, skimr, dplyr.

```{r setup, echo=TRUE}
# Libraries loaded
library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
# Path set
opts_knit$set(root.dir = "C:/Users/mauri/OneDrive/Documents/Personal/GoogleDataAnalytics/Bellabeat Case Study")
```

The dataset files will also be imported into dataframes. The datasets corresponding to the first month will have a trailing "1",and the corresponding to the latter a trailing "2", which will afterwards be merged into a single dataframe. Only the categories with the full pair will be used, so as to have as much data as possible and not fall into sampling bias errors. Also, if a category has pairs complete across different time measurements, the ones with the most measurements will be used in order to not get lost in details. That being said, the ones used will be:

-   Daily Activity

```{r dailyActivityImport&Merge, echo = TRUE}
daiAct_1 <- read.csv("archive/merged/dailyActivity_merged.csv")
daiAct_2 <- read.csv("archive/merged/dailyActivity_merged2.csv")
dailyActivity <- rbind(daiAct_1, daiAct_2)
skim_without_charts(dailyActivity)
```

-   Heart Rate by Seconds

```{r heartRate, echo = TRUE}
heartR_1 <- read.csv("archive/merged/heartrate_seconds_merged.csv")
heartR_2 <- read.csv("archive/merged/heartrate_seconds_merged2.csv")
heartRate <- rbind(heartR_1, heartR_2)
skim_without_charts(heartRate)
```

-   Hourly Calories

```{r hourlyCaloriesImport&Merge, echo = TRUE}
hourCal_1 <- read.csv("archive/merged/hourlyCalories_merged.csv")
hourCal_2 <- read.csv("archive/merged/hourlyCalories_merged2.csv")
hourlyCalories <- rbind(hourCal_1, hourCal_2)
skim_without_charts(hourlyCalories)
```

-   Hourly Intensities

```{r hourlyIntensitiesImport&Merge, echo = TRUE}
hourInt_1 <- read.csv("archive/merged/hourlyIntensities_merged.csv")
hourInt_2 <- read.csv("archive/merged/hourlyIntensities_merged2.csv")
hourlyIntensities <- rbind(hourInt_1, hourInt_2)
skim_without_charts(hourlyIntensities)
```

-   Hourly Steps

```{r hourlyStepsImport&Merge, echo = TRUE}
hourSteps_1 <- read.csv("archive/merged/hourlySteps_merged.csv")
hourSteps_2 <- read.csv("archive/merged/hourlySteps_merged2.csv")
hourlySteps <- rbind(hourSteps_1, hourSteps_2)
skim_without_charts(hourlySteps)
```

-   METs (metabolic equivalents) by minute

```{r minuteMetImport&Merge, echo = TRUE}
minMet_1 <- read.csv("archive/merged/minuteMETsNarrow_merged.csv")
minMet_2 <- read.csv("archive/merged/minuteMETsNarrow_merged2.csv")
minuteMET <- rbind(minMet_1, minMet_2)
skim_without_charts(minuteMET)
```

-   Sleep by minute

```{r minSleepImport&Merge, echo = TRUE}
minSleep_1 <- read.csv("archive/merged/minuteSleep_merged.csv")
minSleep_2 <- read.csv("archive/merged/minuteSleep_merged2.csv")
minuteSleep <- rbind(minSleep_1, minSleep_2)
skim_without_charts(minuteSleep)
```

-   Weight Log information

```{r weightLogImport&Merge, echo = TRUE}
weightLog_1 <- read.csv("archive/merged/weightLogInfo_merged.csv")
weightLog_2 <- read.csv("archive/merged/weightLogInfo_merged2.csv")
weightLog <- rbind(weightLog_1, weightLog_2)
skim_without_charts(weightLog)
```

## Cleaning {#cleaning}

Glancing over the datasets, there were various noteworthy points to mention. For starters, the Daily Activity set contains 457 rows, but only 30 survey respondents, so there must be an unequal distribution of participation and usage of the FitBit trackers. According to the function *skim_without_charts* from the skimr package, the majority of the "Fat" column from the WeightLog dataset provided empty values. To view the existence and amount of duplicate values:

```{r dupeCheck, echo = TRUE}
# Duplicate rows
print("Duplicate rows per Dataframe")
sum(duplicated(dailyActivity))
sum(duplicated(heartRate))
sum(duplicated(hourlyCalories))
sum(duplicated(hourlyIntensities))
sum(duplicated(hourlySteps))
sum(duplicated(minuteMET))
sum(duplicated(minuteSleep))
sum(duplicated(weightLog))
# Unique ID count
print("Unique IDs per Dataframe")
n_unique(dailyActivity$Id)
n_unique(heartRate$Id)
n_unique(hourlyCalories$Id)
n_unique(hourlyIntensities$Id)
n_unique(hourlySteps$Id)
n_unique(minuteMET$Id)
n_unique(minuteSleep$Id)
n_unique(weightLog$Id)
```

Apparently, there were lots of duplicates (likely stemming from certain overlap in data during the merging between months). Also, not all user IDs provided the same categories. To clean the dataframes, column names were standardized into lowercase and underscores, duplicate rows were sought to be removed, as well as those with missing values. For added control, the entries were also sorted by date.

```{r dataCleaning, echo=TRUE}
dailyActivity <- dailyActivity %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(activity_date)
heartRate <- heartRate %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(time)
hourlyCalories <- hourlyCalories %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(activity_hour)
hourlyIntensities <- hourlyIntensities %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(activity_hour)
hourlySteps <- hourlySteps %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(activity_hour)
minuteMET <- minuteMET %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(activity_minute)
minuteSleep <- minuteSleep %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(date)
weightLog <- weightLog %>%
  clean_names() %>%
  distinct() %>%
  drop_na() %>%
  arrange(date)
```

Having almost finished cleaning the dataframes, to set a date format using the *lubridate* package, the following script was ran:

```{r dateChange, echo = TRUE}
dailyActivity <- dailyActivity %>%
  mutate(activity_date = mdy(activity_date))
heartRate <- heartRate %>%
  mutate(time = mdy_hms(time))
hourlyCalories <- hourlyCalories %>%
  mutate(activity_hour = mdy_hms(activity_hour))
hourlyIntensities <- hourlyIntensities %>%
  mutate(activity_hour = mdy_hms(activity_hour))
hourlySteps <- hourlySteps %>%
  mutate(activity_hour = mdy_hms(activity_hour))
minuteMET <- minuteMET %>%
  mutate(activity_minute = mdy_hms(activity_minute))
minuteSleep <- minuteSleep %>%
  mutate(date = mdy_hms(date))
weightLog <- weightLog %>%
  mutate(date = mdy_hms(date))
```

# Analyze and Share Phase {#analyze-and-share-phase}

## Trending Measurement Features {#trending-measurement-features}

For starters, following on the previous noteworthy mention of an unequal ID distribution among dataframes, an informed inference could be conducted about user interest on particular metrics, and how they use their FitBit equipment. Therefore, seeking to find the number of distinct IDs per dataframe, for starters\>

```{r idnumber, echo= TRUE}
usersDailyActivity <- n_distinct(dailyActivity$id)
usersHeartRate <- n_distinct(heartRate$id)
usersHourlyCalories <- n_distinct(hourlyCalories$id)
usersHourlyIntensities <- n_distinct(hourlyIntensities$id)
usersHourlySteps <- n_distinct(hourlySteps$id)
usersMinuteMET <- n_distinct(minuteMET$id)
usersMinuteSleep <- n_distinct(minuteSleep$id)
usersWeightLog <- n_distinct(weightLog$id)
distinctUsers <- data.frame(category = c("Daily Activity", "HeartRate", "Hourly Calories", "Hourly Intensities", "Hourly Steps", "Minute MET", "Minute Sleep", "Weight Log"), users = c(usersDailyActivity, usersHeartRate, usersHourlyCalories, usersHourlyIntensities, usersHourlySteps, usersMinuteMET, usersMinuteSleep, usersWeightLog))
ggplot(distinctUsers, aes(x = category, y = users)) + geom_bar(stat = "identity", fill = "lightblue") + labs(title = "Number of Distinct Users per Category", x = "Category", y = "Number of Users") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

Apparently, not all users are really concerned with their Heart Rate, Sleeping Trends and even less with their Weight Logs. For Bellabeat, this is both a pro and con, having a market segment that could grow, but currently does not have enought attention from consumers. As such, the trend appears fixated on consumers' interest in their Activity Monitoring, not on their health overall.

## User Demographics and Patterns {#user-demographics-and-patterns}

Having identified user interest in lifestyle and activity, the question can be asked, "who are these users?", and what do their day-to-day activities consist on?. Not much information is given due to privacy reasons, so a different approach was taken. To define 5 baselines over which to group users based on step count, the maximum amount of steps was found using:

```{r maxsteps, echo = TRUE}
maxSteps <- max(dailyActivity$total_steps)
set <- maxSteps / 5
```

Five categories will be defined:

-   Sedentary: 0-7203 steps
-   Sedentary-Active: 7203-14406 steps
-   Active: 14406-21609 steps
-   Active-Athletic: 21609-28812 steps
-   Athletic: +28812 steps

```{r stepgrouping, echo=TRUE}
setAthletic <- dailyActivity %>%
  filter(total_steps > 28812) %>%
  summarise(count = n())
setActiveAthletic <- dailyActivity %>%
  filter(21609 < total_steps & total_steps < 28812) %>%
  summarise(count = n())
setActive <- dailyActivity %>%
  filter(14406 < total_steps & total_steps < 21609) %>%
  summarise(count = n())
setSedentaryActive <- dailyActivity %>%
  filter(7203 < total_steps & total_steps < 14406) %>%
  summarise(count = n())
setSedentary <- dailyActivity %>%
  filter(0 < total_steps & total_steps < 7203) %>%
  summarise(count = n())
stepSet <- data.frame(
  category = c("Sedentary", "Sedentary-Active", "Active", "Active-Athletic", "Athletic"),
  user_count = c(setSedentary$count, setSedentaryActive$count, setActive$count, setActiveAthletic$count, setAthletic$count)
)
ggplot(stepSet, aes(x = "", y = user_count, fill = category)) + geom_bar(width = 1, stat = "identity") + coord_polar(theta = "y") + labs(title = "Proportion of User Group Activity Levels Based on Steps", x = "", y = "") + theme_minimal() + theme(legend.position = "right") + scale_fill_brewer(palette = "Pastel1") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
totalUsers <- nrow(dailyActivity)
stepSet <- stepSet %>%
  mutate(percentage = 100*(user_count/totalUsers))

```

It's noteworthy that only 30 or so people submitted their user data. Still, the pie chart serves as a useful representation on how is user activity from FitBit is distributed. Turning to a common statistical tool, the Gaussian Distribution chart, the following script was ran:

```{r histogramuseractivity, echo = TRUE}
ggplot(dailyActivity, aes(x = total_steps)) + geom_density(kernel = "gaussian", fill = "lightblue", alpha = 0.5) + labs(title = "Gaussian Distribution Step Counts", x = "Total Steps", y = "Density") + theme_minimal()
```

From both the pie chart and the gaussian distribution graph, a few insights can be made:

-   The highest density (peak of the curve) occurs in the range of low step counts (0-10,000).
-   Around 80% of the sample population falls into the Sedentary or Sedentary-Active categories, with a minority in Athletic people.
-   Left-skewing of graph shows that athletic users are not significantly represented in the dataset collections.

## Correlation Between Variables {#correlation-between-variables}

Trying to connect with the Sleep dataframe, a relationship was sought between Hourly Calories and the amount of rest people carried out.

```{r sleepcals, echo = TRUE}
dailySleep <- minuteSleep %>%
  mutate(date = as.Date(date)) %>%
  group_by(id, date) %>%
  summarise(total_minutes_slept = sum(value, na.rm = TRUE)) %>%
  ungroup()
dailyCal <- hourlyCalories %>%
  mutate(activity_hour = as.Date(activity_hour)) %>%
  rename(date = activity_hour) %>%
  group_by(id, date) %>%
  summarise(calories = sum(calories, na.rm = TRUE)) %>%
  ungroup()
dailySleepCal <- dailySleep %>%
  inner_join(dailyCal, by = c("id", "date")) %>%
  arrange(date)
cor(dailySleepCal$calories, dailySleepCal$total_minutes_slept)
ggplot(dailySleepCal, aes(x=calories, y = total_minutes_slept)) + geom_point() + theme_minimal() + labs(title = "Total Daily Calories vs Total Minutes Slept", x = "Daily Calories", y = "Total Minutes Slept")
ggplot(dailySleepCal, aes(x= calories, y=total_minutes_slept)) + geom_jitter() + geom_smooth(color = "red")+ labs(title = "Smoothed Total Daily Calories vs Total Minutes Slept", x = "Daily Calories", y = "Total Minutes Slept") + theme_minimal()
```

Initially, as daily calorie consumption increases, total minutes slept increase slightly. Then, the trend flattens and slightly decreases before increasing again at higher calorie intakes. Again, the trend is expected to be a bit steeper, but the sample size, again, was too small. Unfortunately, the cor() function yields a -0.15, so the relationship between daily calorie consumption and total minutes slept appears to be very weak. Seeking a different group of variables to seek a relationship between, the minuteMET (metabolic equivalent) and minuteSleep sets were chosen.

```{r metsleep, echo = TRUE}
dailyMET <- minuteMET %>%
  mutate(activity_minute = as.Date(activity_minute)) %>%
  rename(date = activity_minute) %>%
  group_by(id, date) %>%
  summarise(me_ts= sum(me_ts, na.rm = TRUE)) %>%
  ungroup()
dailySleepMET <- dailySleep %>%
  inner_join(dailyMET, by = c("id", "date")) %>%
  arrange(date)
cor(dailySleepMET$me_ts, dailySleepMET$total_minutes_slept)
ggplot(dailySleepMET, aes(x=me_ts, y = total_minutes_slept)) + geom_point() + theme_minimal() + labs(title = "Total Daily METs vs Total Minutes Slept", x = "METs", y = "Total Minutes Slept")
ggplot(dailySleepMET, aes(x= me_ts, y=total_minutes_slept)) + geom_jitter() + geom_smooth(color = "red")+ labs(title = "Smoothed Total Daily METs vs Total Minutes Slept", x = "METs", y = "Total Minutes Slept") + theme_minimal()
```

Unfortunately, once again the correlation coefficient was quite weak (-0.09). Therefore, the variable pair \<steps, sleep\> was chosen.

```{r stepssleep, echo = TRUE}
dailySteps <- hourlySteps %>%
  mutate(activity_hour = as.Date(activity_hour)) %>%
  rename(date = activity_hour) %>%
  group_by(id, date) %>%
  summarise(step_total= sum(step_total, na.rm = TRUE)) %>%
  ungroup()
dailySleepSteps <- dailySleep %>%
  inner_join(dailySteps, by = c("id", "date")) %>%
  arrange(date)
cor(dailySleepSteps$step_total, dailySleepSteps$total_minutes_slept)
ggplot(dailySleepSteps, aes(x=step_total, y = total_minutes_slept)) + geom_point() + theme_minimal() + labs(title = "Total Daily Steps vs Total Minutes Slept", x = "Steps", y = "Total Minutes Slept")
ggplot(dailySleepSteps, aes(x= step_total, y=total_minutes_slept)) + geom_jitter() + geom_smooth(color = "red")+ labs(title = "Smoothed Total Daily Steps vs Total Minutes Slept", x = "Steps", y = "Total Minutes Slept") + theme_minimal()
```

Again, the correlation coefficient was very weak (-0.09). As a last attempt, the distance logged was tried to be matched with the sleep set.

```{r distancessleep, echo = TRUE}
dailyDistance <- dailyActivity %>%
  mutate(activity_date = as.Date(activity_date)) %>%
  rename(date = activity_date) %>%
  group_by(id, date) %>%
  summarise(total_distance= sum(total_distance, na.rm = TRUE)) %>%
  ungroup()
dailySleepDistance <- dailySleep %>%
  inner_join(dailyDistance, by = c("id", "date")) %>%
  arrange(date)
cor(dailySleepDistance$total_distance, dailySleepDistance$total_minutes_slept)
ggplot(dailySleepDistance, aes(x=total_distance, y = total_minutes_slept)) + geom_point() + theme_minimal() + labs(title = "Total Distance vs Total Minutes Slept", x = "Distance", y = "Total Minutes Slept")
ggplot(dailySleepDistance, aes(x= total_distance, y=total_minutes_slept)) + geom_jitter() + geom_smooth(color = "red")+ labs(title = "Smoothed Total Daily Distance vs Total Minutes Slept", x = "Distance", y = "Total Minutes Slept") + theme_minimal()
```

Once again, the correlation turned out weak (-0.11). As an inference, there must be an external factor influencing what would otherwise be an intuitive relationship between variables. Also, the sleep tracker was not consistently used, so that could be a reason as to why data doesn't show a connection as strong as one would think. Also, most people only have a set amount of time available for sleeping due to other day-to-day activities, which greatly influences and limits the behavior of data independently of other variables that could or could not affect them.

Turning towards other datasets in order to find a pair of variables that actively show correlations, the attempt was made with Calories and Steps.

```{r stepscal, echo = TRUE}
dailyCalSteps <- dailyCal %>%
  inner_join(dailySteps, by = c("id","date")) %>%
  arrange(date)
cor(dailyCalSteps$step_total, dailyCalSteps$calories)
ggplot(dailyCalSteps, aes(x=step_total, y = calories)) + geom_point() + theme_minimal() + labs(title = "Total Daily Steps vs Total Calories Spent", x = "Total Steps", y = "Total Calories Spent")
ggplot(dailyCalSteps, aes(x= step_total, y=calories)) + geom_jitter() + geom_smooth(color = "red")+ labs(title = "Smoothed Total Daily Steps vs Total Calories Spent", x = "Total Steps", y = "Total Calories Spent") + theme_minimal()
```

There is a clear positive correlation between total daily steps and total calories spent (0.62, which could be stronger but its a start). More steps generally lead to higher calorie expenditure. The variability in calories spent for a given number of steps, especially at lower step counts, suggests other influencing factors on calorie expenditure, like age, gender, geographical environment, nourishment, particular conditions, etc.

## Consistency in Use of Smart-Device {#consistency-in-use-of-smart-device}

Moving on, trying to find how consistent the use of the smart device was, the following script was ran:

```{r daysused, echo=TRUE}
daysUsed <- dailyActivity %>%
  group_by(id) %>%
  summarize(unique_dates = n_distinct(activity_date))
ggplot(daysUsed, aes(x = factor(id), y = unique_dates, fill = unique_dates)) + geom_col() + scale_fill_gradient(low = "lightblue", high = "darkblue") + labs(title = "Histogram of Unique Dates Logged per ID", x = "ID", y = "Days Logged", fill = "Unique Dates") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
mean(daysUsed$unique_dates)
n_distinct(dailyActivity$activity_date)
max(daysUsed$unique_dates)
min(daysUsed$unique_dates)
commitmentDates <- mean(daysUsed$unique_dates) / n_distinct(dailyActivity$activity_date) * 100
```

From both the graph and the calculations, its apparent that of the 62 different dates in the cleaned dataframes, on average, each user ID logged data for 39.2 days, or 39. The least amount of days logged was 8, and the maximum was effectively, 62. Therefore, on average, comparing against the total dates, users were 63.27% committed in reporting data with their smart tracker.

# Act Phase {#act-phase}

*Analysis Summary*

| Key Words | Insight |
|--------------------|-----------------------------------------|
| 1. Trending Data Features | Products or campaigns that include trackings of Daily Activity, Calorie Consumption, Activity Intensity, and METs are likely to resonate with a Sedentary or Sedentary-Active demography of users, which are the category the FitBit users fall into. |
| 2. Low-Engagement Features | Heart rate, sleep monitoring and weight tracking features could appeal to a different demographic that is not being covered by the FitBit devices, likely the Active or Athletic categories. Increasing awareness of benefits obtained by monitoring these health features, maybe also including fat and blood oxygen logging, might increase interest. |
| 3. Sleep Tracking | Moderate engagement in sleep data suggests potential for growth. |
| 4. Activity-based Audience Targeting | 80% of the reported users with FitBit fall into the sedentary and sedentary-active ranges, with a minority in the Athletic range. |
| 5. Connected Variables | Minutes slept has a weak correlation with most of the variables, likely due to external factors being present, but steps and calorie burning have a medium-strong relationship. |
| 6. Usage Consistency | Users were 63.27% committed in reporting daily activity data with their smart tracker. |


*Recommendations*
General recommendations regarding tactics and strategies that follow Bellabeat could include:

* Short term: marketing and educational campaigns about the benefits of tracking health and activity, as well as targeting current device features and expanding the existing user base.
* Medium term: product updates and product development focused on new features.
* Long term: expansion into new demographics and markets, as well as collaboration with healthcare providers.

Now, relating insights towards particular recommendations:

1.Focus on Trending Data Features: Bellabeat can capitalize on the growing interest in activity tracking by improving its offerings for basicc daily activity monitoring, such as calorie consumption, steps taken, activity intensity, and METs. These features strongly resonate with sedentary and sedentary-active users, which represent the majority of the demographic in a casual smart-device usage such as FitBit, the source of the data. To appeal to this group, Bellabeat should develop a user-friendly interface that visually presents this data and provides personalized activity goals. Gamification, such as progress badges or challenges, could further motivate users to engage consistently with these features. This would not only improve user satisfaction but also position Bellabeat as a leader in the wellness tracking industry for sedentary users.

2. Enhance Low-Engaged Features: While core features like heart rate monitoring, weight tracking, and sleep data are available, these currently lack high engagement, likely from the complementary demographic opposed to sedentary ones, being more interested in the deeper levels of fitness tracking. To address this, Bellabeat could introduce advanced health-tracking features, such as blood oxygen monitoring and body fat percentage logging, combined with educational tools explaining the benefits of heart rate monitoring, sleep tracking and also weight logging. Partnering with health experts to provide users with actionable insights into their data could also boost interest. Developing a premium subscription tier that includes in-depth analytics or virtual health consultations could further incentivize users to explore these lesser-used features.

3. Promote Sleep Tracking: Sleep tracking presents a growth opportunity for Bellabeat, as moderate engagement suggests room for improvement. The company can differentiate itself by offering personalized insights and actionable recommendations to help users improve sleep quality. Features like guided bedtime routines, sleep coaching, or notifications about optimal sleep windows could increase user interaction and interest in said metric. Bellabeat might also consider integrating a “Sleep Score” feature or product that provides a comprehensive view of users’ sleep health and offers tailored advice for improvement. Enhancing this service would position Bellabeat as a comprehensive health and wellness platform with a diverse product portfolio and opportunities for users.

4. Target Activity-Based Audiences: Bellabeat should consider segmenting its user base by activity levels—sedentary, sedentary-active, and athletic—and tailoring its marketing and product features accordingly. For sedentary users, campaigns could emphasize ease of use and gradual habit building, while for more active users, Bellabeat could highlight advanced tracking features and progress analytics. To attract athletic users, Bellabeat could introduce new capabilities, such as VO2 max tracking or interval training analytics, along with targeted partnerships with sports organizations or fitness clubs, as well as deepening their focus on particular insights in health supervision that is often overlooked, like heart monitoring. By broadening its demographic appeal, Bellabeat could significantly expand its market share.

5. Bellabeat has the opportunity to use insights about the medium-strong correlation between steps and calorie burning to create dynamic, activity-based programs. For example, personalized fitness plans could be generated to help users achieve specific calorie-burning goals, paired with real-time feedback on their progress. A machine learning feature that tracks and predicts calorie expenditure based on activity patterns could also enhance user engagement. Such developments would deepen user trust in the product’s ability to provide actionable insights, reinforcing Bellabeat’s value as a reliable fitness and wellness companion.

6. Consistency is key to extracting value from wellness trackers, and Bellabeat’s data indicates that many users are already committed to tracking daily activity. To encourage even greater consistency, Bellabeat could introduce streak-based rewards or recognition systems that incentivize regular engagement. Monthly progress reports showcasing achievements and offering tips for continued improvement could also be effective. Additionally, integrating reminders for users who miss logging activity could further reduce attrition. By fostering routine usage, Bellabeat ensures long-term user loyalty and maximizes the lifetime value of its customers.

To strengthen its relationship with users and encourage long-term engagement, Bellabeat should explore product bundling and subscription models. For instance, smart trackers could be paired with access to exclusive wellness content, such as guided meditation, fitness routines, or stress management tools. Subscriptions could also include premium features like one-on-one virtual coaching sessions, advanced health insights, or exclusive health challenges. This approach not only provides additional value to users but also generates a steady revenue stream, allowing Bellabeat to continually innovate and expand its offerings.To remain competitive, Bellabeat should consider diversifying its product offerings to serve a broader audience, particularly more active and athletic users who are currently underserved. This could involve developing a new product line tailored to high-performance fitness tracking, featuring metrics such as advanced heart rate variability analysis, workout recovery suggestions, or sport-specific analytics. Additionally, Bellabeat could launch a subscription-based wellness service offering features like personalized coaching, guided workouts, or tailored nutrition plans. By combining innovative hardware with value-added services, Bellabeat could position itself as an all-in-one solution for health and fitness enthusiasts.

By implementing these strategies and introducing new services or products, Bellabeat can broaden its appeal, strengthen user engagement, and establish itself as a leader in the wellness tech industry. Nevertheless, it is paramount to keep analyzing up-to-date and diversified data in order to make the appropriate data-driven decisions.


# References {#references}

Psico-Smart Editorial Team. (2024). The Impact of Wearable Technology on Personal Health Monitoring. <https://psico-smart.com/en/blogs/blog-the-impact-of-wearable-technology-on-personal-health-monitoring-162704#>:\~:text=Statistics%20indicate%20that%20as%20of,50%25%20in%20their%20health%20routines.

National Heart, Lung and Blood Institute (2023). Study reveals wearable device trends among US adults. NIH. <https://www.nhlbi.nih.gov/news/2023/study-reveals-wearable-device-trends-among-us-adults#>:\~:text=Almost%20one%20in%20three%20Americans,to%20thousands%20of%20adults%20who

Kaggle, (2024). FitBit Fitness Tracker Data. [dataset]. <https://www.kaggle.com/datasets/arashnic/fitbit>