-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathwalkthrough.Rmd
More file actions
98 lines (74 loc) · 6.85 KB
/
walkthrough.Rmd
File metadata and controls
98 lines (74 loc) · 6.85 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "Tidycensus Walkthrough"
author: "Jackson M Luckey"
date: "Last Updated 4/2/2020"
output:
tufte::tufte_html: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r loadPackages, echo=FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(knitr) # for kable
library(tufte)
library(tidycensus)
```
```{r tufteTheme, echo=FALSE, warning=FALSE, message=FALSE}
theme_tufte <- theme(panel.background = element_rect("#fffff8", "#fffff8"),
plot.background = element_rect("#fffff8", "#fffff8"))
```
# Introduction to Tidycensus
The R package Tidycensus provides a Tidyverse-friendly way of working with census data. Tidycensus was created by Kyle Walker, who has provided some documentation [here](https://walkerke.github.io/tidycensus/articles/basic-usage.html). Tidycensus allows "R users to return Census and ACS data as tidyverse-ready data frames, and optionally returns a list-column with feature geometry for many geographies". The "list-column with feature geometry" allows the R user to easily draw maps using the downloaded census data and ggplot2. Dataframes can be downloaded in either a wide or tidy format, and brief descriptions of each variable are available. The package is an API wrapper for data.census.gov, and the API calls it creates can be accessed and manually ran using packages such as rvest. The package supports both American Community Survey and Decennial data. To pull down data with Tidycensus, you need to know the variable/table name, the year, the geographic level, and the survey.
## Identifying Variables and Tables
To identify the variables and tables available from a survey, use the command `tidycensus::load_variables(year, survey)`, where year is a year in numerical format (e.g. 2018), and survey is a survey in character format (e.g. "acs1"). This returns a dataframe with the columns `name`, `label`, and `concept`. To search for variables within the survey, filter the output of `load_variables()` with `filter(stringr::str_detect(column, pattern))`. If you don't know the name of the table you want to work with, filter on the `concept` column as shown below:
```{r searchVariablesByConcept, echo=TRUE}
tidycensus::load_variables(2018, "acs1") %>%
mutate(concept = stringr::str_to_lower(concept)) %>% # makes searching case insensitive
filter(stringr::str_detect(concept, "education")) %>%
slice(1:10) %>%
kable(caption = "First 10 Variables Related to Education")
```
If you already know the table name you're interested, filter on the `name` column and set pattern equal to the table name followed by an underscore. For example, the American Community Survey one year estimates stores disability data in the table `B18101`. To look at the variables in that table, filter the results of `tidycensus::load_variables() as shown below:
```{r}
tidycensus::load_variables(2018, "acs1") %>%
filter(stringr::str_detect(name, "B18101_")) %>%
kable(caption = "Ten Disability Variables in Table B18101")
```
In every case, the `tidycensus::load_variables()` function will return half of the number of variables that the actual census query will return. This is because each variable comes with both an estimate and margin of error.
The geographical level, survey, and year are all easier to determine. Geographical levels are things like state, county, school district, census tract, and congressional district. They are represented as characters. Not all geographical levels are supported by Tidycensus, and not all surveys are available for all geographies. In general, surveys collected less are more likely to include smaller geographical levels. Surveys include the different American Community Surveys, the Decennial census, and supplemental estimates. They can be found on the U.S. Census's website. Finally, year is simply a valid year for that survey provided as a number. Make sure to remember that variables can change between survey years.
## Pulling Down Data
To pull down actual data from census.data.gov using Tidycensus, you use one of three functions depending on the survey that you are working with. To work with the decennial census, use `get_decennial()`. To work with American Community Survey data, use `get_acs()`. Finally, use `get_estimates()` to work with the Census Bureau's population estimates API. If you download the data in wide format, Tidycensus returns a table with the columns `GEOID` and `NAME`, which refer to the geographical region, and two columns, `Table_Variable + E` and `Table_Variable + M`, for each variable in the table. For example, if the table was "someTable", and it only included the variables "001" and "002", the returned dataframe would included the columns `someTable_001E, someTable_001M, someTable_002E, someTable_002M`. E refers to estimate, while M refers to margin of error. Finally, if you included the argument `geometry = TRUE` in the call, there will be a column `geometry` that includes the data required to draw a map of the geographical regions included in the call.
```{r}
disability <- tidycensus::get_acs("state",
table = "B18101",
year = 2018,
output = "wide",
survey = "acs1")
```
To drop margin of errors, you can filter the `disability` dataframe using `dplyr::select()` and the `tidyselect` helper `ends_with()`, since all margin of error columns end with "M".
```{r}
disability <- disability %>%
select(-ends_with("M"))
```
Finally, we'll need to convert the columns into more meaninful data. The `B18101` census table provides raw counts of the number of people with disabilties broken down by biological sex[^1] and age. In order to compare states, we'll want to divide the number of people with a disability by the total number of people. Let's start by examining females aged 35 to 64. As shown in the table above, the total count of females aged 35 to 64 is stored in the column `B18101_31E`, while the count of those with disabilities is stored in `B18101_32E`.
```{r}
```
[^1]: The census collects data on biological sex and not gender. More information about how the Census Bureau *currently* collects data on age and sex is available [here](https://www.census.gov/topics/population/age-and-sex/about.html). Presumably this policy might change under a different administration. For brevity and to match the census variable labels, these sex classifications will be referred to as "male" and "female" throughout the rest of the tutorial.
## Making Maps
```{r themeMap}
theme_map <- theme_minimal() +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank())
```
```{r}
ggplot(disability, aes(fill = )) +
theme_map +
theme_tufte
```
## residual trick with regressions goes here