Skip to content

Commit d2469e9

Browse files
committed
differences for PR #149
1 parent 3ba135d commit d2469e9

3 files changed

Lines changed: 136 additions & 2 deletions

File tree

.Rhistory

Whitespace-only changes.

04-data-structures-part2.md

Lines changed: 135 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,8 @@ str(gapminder)
9999
$ gdpPercap: num 779 821 853 836 740 ...
100100
```
101101

102-
We can also examine individual columns of the data frame with our `class` function:
102+
We can also examine individual columns of the data frame with the `class` or
103+
'typeof' functions:
103104

104105

105106
``` r
@@ -110,6 +111,14 @@ class(gapminder$year)
110111
[1] "integer"
111112
```
112113

114+
``` r
115+
typeof(gapminder$year)
116+
```
117+
118+
``` output
119+
[1] "integer"
120+
```
121+
113122
``` r
114123
class(gapminder$country)
115124
```
@@ -400,6 +409,131 @@ tail(gapminder_norway)
400409
```
401410

402411

412+
413+
## Removing columns and rows in data frames
414+
415+
To remove columns from a data frame, we can use the 'subset' function.
416+
This function allows us to remove columns using their names.
417+
If we want to keep all columns except continent, pop and gdpPercap we can use the following `subset` command:
418+
419+
420+
``` r
421+
life_expectancy <- subset(gapminder, select = -c(continent, pop, gdpPercap))
422+
head(life_expectancy)
423+
```
424+
425+
``` output
426+
country year lifeExp below_average
427+
1 Afghanistan 1952 28.801 TRUE
428+
2 Afghanistan 1957 30.332 TRUE
429+
3 Afghanistan 1962 31.997 TRUE
430+
4 Afghanistan 1967 34.020 TRUE
431+
5 Afghanistan 1972 36.088 TRUE
432+
6 Afghanistan 1977 38.438 TRUE
433+
```
434+
435+
We can also use a logical vector to achieve the same result. Make sure the
436+
vector's length matches the number of columns in the data frame (to avoid R
437+
repeating the shorter vector to match the length of the longer vector, called
438+
"vector recycling"):
439+
440+
441+
``` r
442+
life_expectancy <- gapminder[c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)]
443+
head(life_expectancy)
444+
```
445+
446+
``` output
447+
country year lifeExp below_average
448+
1 Afghanistan 1952 28.801 TRUE
449+
2 Afghanistan 1957 30.332 TRUE
450+
3 Afghanistan 1962 31.997 TRUE
451+
4 Afghanistan 1967 34.020 TRUE
452+
5 Afghanistan 1972 36.088 TRUE
453+
6 Afghanistan 1977 38.438 TRUE
454+
```
455+
456+
:::::: spoiler
457+
458+
### Vector Recycling
459+
460+
Vector recycling occurs when working with vectors of different length and it
461+
consist of repeating the elements of the shorter vector up to the length of
462+
the larger one. For more information, check the book R for Data Science and its
463+
[chapter about vectors](https://r4ds.had.co.nz/vectors.html#scalars-and-recycling-rules).
464+
::::::::
465+
466+
Alternatively, we can use column positions:
467+
468+
469+
``` r
470+
life_expectancy <- gapminder[-c(3, 4, 6)]
471+
head(life_expectancy)
472+
```
473+
474+
``` output
475+
country year lifeExp below_average
476+
1 Afghanistan 1952 28.801 TRUE
477+
2 Afghanistan 1957 30.332 TRUE
478+
3 Afghanistan 1962 31.997 TRUE
479+
4 Afghanistan 1967 34.020 TRUE
480+
5 Afghanistan 1972 36.088 TRUE
481+
6 Afghanistan 1977 38.438 TRUE
482+
```
483+
484+
Note that typically we select the rows we want to keep, rather than removing rows we do not want in the data.
485+
However, to remove rows from a data frame, we can use their positions.
486+
To practice on a smaller subset, we will filter the data to only those entries from Afghanistan after the year 2000.
487+
This smaller dataset will be easier for us to inspect by eye and see the changes we are making.
488+
489+
490+
``` r
491+
# Filter data for Afghanistan during the 20th century:
492+
afghanistan_20c <- gapminder[gapminder$country == "Afghanistan" &
493+
gapminder$year > 2000, ]
494+
495+
# Now remove data for 2002, that is, the first row:
496+
afghanistan_20c[-1, ]
497+
```
498+
499+
``` output
500+
country year pop continent lifeExp gdpPercap below_average
501+
12 Afghanistan 2007 31889923 Asia 43.828 974.5803 TRUE
502+
```
503+
504+
505+
In research, we often remove rows based on features of the data itself, rather than its location.
506+
For example, you may want to remove all the missing data prior to an analysis. Let's first add some missing values (NAs) into the data and then we can use `na.omit()` to remove them.
507+
508+
509+
``` r
510+
# Turn some values into NAs:
511+
afghanistan_20c <- gapminder[gapminder$country == "Afghanistan", ]
512+
afghanistan_20c[afghanistan_20c$year < 2007, "year"] <- NA
513+
head(afghanistan_20c)
514+
```
515+
516+
``` output
517+
country year pop continent lifeExp gdpPercap below_average
518+
1 Afghanistan NA 8425333 Asia 28.801 779.4453 TRUE
519+
2 Afghanistan NA 9240934 Asia 30.332 820.8530 TRUE
520+
3 Afghanistan NA 10267083 Asia 31.997 853.1007 TRUE
521+
4 Afghanistan NA 11537966 Asia 34.020 836.1971 TRUE
522+
5 Afghanistan NA 13079460 Asia 36.088 739.9811 TRUE
523+
6 Afghanistan NA 14880372 Asia 38.438 786.1134 TRUE
524+
```
525+
526+
``` r
527+
# Remove NAs
528+
na.omit(afghanistan_20c)
529+
```
530+
531+
``` output
532+
country year pop continent lifeExp gdpPercap below_average
533+
12 Afghanistan 2007 31889923 Asia 43.828 974.5803 TRUE
534+
```
535+
536+
403537
## Factors
404538

405539
Here is another thing to look out for: in a `factor`, each different value

md5sum.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"episodes/01-rstudio-intro.Rmd" "f4e11815e378019213cd8bc32bd5d292" "site/built/01-rstudio-intro.md" "2025-11-18"
77
"episodes/02-project-intro.Rmd" "d833fc665b635d29bf03d58bb34524a9" "site/built/02-project-intro.md" "2025-11-18"
88
"episodes/03-data-structures-part1.Rmd" "0165027d9aa46f1f0c499f4c9daa8266" "site/built/03-data-structures-part1.md" "2025-11-18"
9-
"episodes/04-data-structures-part2.Rmd" "1cdde317409584348e41142273f08428" "site/built/04-data-structures-part2.md" "2025-11-18"
9+
"episodes/04-data-structures-part2.Rmd" "1a9302c3c8f796b5536c19bb35844ea8" "site/built/04-data-structures-part2.md" "2026-02-02"
1010
"episodes/05-data-subsetting.Rmd" "b673744f991a865b9996504197cc013e" "site/built/05-data-subsetting.md" "2025-11-18"
1111
"episodes/06-dplyr.Rmd" "df360e575e1df180ef2e9bb3d5255a57" "site/built/06-dplyr.md" "2025-11-18"
1212
"episodes/07-plot-ggplot2.Rmd" "960f92a6bc0a0d859457c530a01841b3" "site/built/07-plot-ggplot2.md" "2025-11-18"

0 commit comments

Comments
 (0)