Skip to content

Commit ee9d4ee

Browse files
committed
source commit: 020fd02
0 parents  commit ee9d4ee

24 files changed

Lines changed: 2753 additions & 0 deletions

00-sql-introduction.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
title: Introducing Databases and SQL
3+
teaching: 60
4+
exercises: 5
5+
---
6+
7+
::::::::::::::::::::::::::::::::::::::: objectives
8+
9+
- Describe why relational databases are useful.
10+
- Create and populate a database from a text file.
11+
- Define SQLite data types.
12+
13+
::::::::::::::::::::::::::::::::::::::::::::::::::
14+
15+
:::::::::::::::::::::::::::::::::::::::: questions
16+
17+
- What is a relational database and why should I use it?
18+
- What is SQL?
19+
20+
::::::::::::::::::::::::::::::::::::::::::::::::::
21+
22+
### Setup
23+
24+
*Note: this should have been done by participants before the start of the workshop.*
25+
26+
We use [DB Browser for SQLite](https://sqlitebrowser.org/) and the
27+
[Portal Project dataset](https://figshare.com/articles/Portal_Project_Teaching_Database/1314459)
28+
throughout this lesson. See [Setup](../learners/setup.md) for
29+
instructions on how to download the data, and also how to install DB Browser for SQLite.
30+
31+
## Motivation
32+
33+
To start, let's orient ourselves in our project workflow. Previously,
34+
we used Excel and OpenRefine to go from messy, human created data
35+
to cleaned, computer-readable data. Now we're going to move to the next piece
36+
of the data workflow, using the computer to read in our data, and then
37+
use it for analysis and visualization.
38+
39+
### What is SQL?
40+
41+
SQL stands for Structured Query Language. SQL allows us to interact with relational databases through queries.
42+
These queries can allow you to perform a number of actions such as: insert, select, update and delete information in a database.
43+
44+
### Dataset Description
45+
46+
The data we will be using is a time-series for a small mammal community in
47+
southern Arizona. This is part of a project studying the effects of rodents and
48+
ants on the plant community that has been running for almost 40 years. The
49+
rodents are sampled on a series of 24 plots, with different experimental
50+
manipulations controlling which rodents are allowed to access which plots.
51+
52+
This is a real dataset that has been used in over 100 publications. We've
53+
simplified it for the workshop, but you can download the
54+
[full dataset](https://esapubs.org/archive/ecol/E090/118/) and work with it using
55+
exactly the same tools we'll learn about today.
56+
57+
### Questions
58+
59+
Let's look at some of the cleaned spreadsheets you downloaded during [Setup](../learners/setup.md) to complete this challenge. You'll need the following three files:
60+
61+
- `surveys.csv`
62+
- `species.csv`
63+
- `plots.csv`
64+
65+
::::::::::::::::::::::::::::::::::::::: challenge
66+
67+
### Challenge
68+
69+
Open each of these csv files and explore them.
70+
What information is contained in each file? Specifically, if I had
71+
the following research questions:
72+
73+
- How has the hindfoot length and weight of *Dipodomys* species changed over time?
74+
- What is the average weight of each species, per year?
75+
- What information can I learn about *Dipodomys* species in the 2000s, over time?
76+
77+
What would I need to answer these questions? Which files have the data I need? What
78+
operations would I need to perform if I were doing these analyses by hand?
79+
80+
81+
::::::::::::::::::::::::::::::::::::::::::::::::::
82+
83+
### Goals
84+
85+
In order to answer the questions described above, we'll need to do the
86+
following basic data operations:
87+
88+
- select subsets of the data (rows and columns)
89+
- group subsets of data
90+
- do math and other calculations
91+
- combine data across spreadsheets
92+
93+
In addition, we don't want to do this manually! Instead of searching
94+
for the right pieces of data ourselves, or clicking between spreadsheets,
95+
or manually sorting columns, we want to make the computer do the work.
96+
97+
In particular, we want to use a tool where it's easy to repeat our analysis
98+
in case our data changes. We also want to do all this searching without
99+
actually modifying our source data.
100+
101+
Putting our data into a relational database and using SQL will help us achieve these goals.
102+
103+
::::::::::::::::::::::::::::::::::::::::: callout
104+
105+
### Definition: *Relational Database*
106+
107+
A relational database stores data in *relations* made up of *records* with *fields*.
108+
The relations are usually represented as *tables*;
109+
each record is usually shown as a row, and the fields as columns.
110+
In most cases, each record will have a unique identifier, called a *key*,
111+
which is stored as one of its fields.
112+
Records may also contain keys that refer to records in other tables,
113+
which enables us to combine information from two or more sources.
114+
115+
116+
::::::::::::::::::::::::::::::::::::::::::::::::::
117+
118+
## Databases
119+
120+
### Why use relational databases
121+
122+
Using a relational database serves several purposes.
123+
124+
- It keeps your data separate from your analysis.
125+
- This means there's no risk of accidentally changing data when you analyze it.
126+
- If we get new data we can rerun the query.
127+
- It's fast, even for large amounts of data.
128+
- It improves quality control of data entry (type constraints and use of forms in MS Access, Filemaker, Oracle Application Express etc.)
129+
- The concepts of relational database querying are core to understanding how to do similar things using programming languages such as R or Python.
130+
131+
### Database Management Systems
132+
133+
There are different database management systems to work with relational databases
134+
such as SQLite, MySQL, PostgreSQL, MSSQL Server, and many more. Each of them differ
135+
mainly based on their scalability, but they all share the same core principles of
136+
relational databases. In this lesson, we use SQLite to introduce you to SQL and
137+
data retrieval from a relational database.
138+
139+
### Relational databases
140+
141+
Let's look at a pre-existing database, the `portal_mammals.sqlite`
142+
file from the Portal Project dataset that we downloaded during
143+
[Setup](../learners/setup.md). In DB Browser for SQLite, click on the "Open Database" button, select the portal\_mammals.sqlite file, and click "Open" to open the database.
144+
145+
You can see the tables in the database by looking at the left hand side of the
146+
screen under Database Structure tab. Here you will see a list under "Tables." Each item listed here corresponds to one of the `csv` files
147+
we were exploring earlier. To see the contents of any table, right-click on it, and
148+
then click the "Browse Table" from the menu, or select the "Browse Data" tab next to the "Database Structure" tab and select the wanted table from the dropdown named "Table". This will
149+
give us a view that we're used to - a copy of the table. Hopefully this
150+
helps to show that a database is, in some sense, only a collection of tables,
151+
where there's some value in the tables that allows them to be connected to each
152+
other (the "related" part of "relational database").
153+
154+
The "Database Structure" tab also provides some metadata about each table. If you click on the down arrow next to a table name, you will see information about the columns, which in databases are referred to as "fields," and their assigned data types.
155+
(The rows of a database table
156+
are called *records*.) Each field contains
157+
one variety or type of data, often numbers or text. You can see in the
158+
`surveys` table that most fields contain numbers (BIGINT, or big integer, and FLOAT, or floating point numbers/decimals) while the `species`
159+
table is entirely made up of text fields.
160+
161+
The "Execute SQL" tab is blank now - this is where we'll be typing our queries
162+
to retrieve information from the database tables.
163+
164+
To summarize:
165+
166+
- Relational databases store data in tables with fields (columns) and records
167+
(rows)
168+
- Data in tables has types, and all values in a field have
169+
the same type ([list of data types](#datatypes))
170+
- Queries let us look up data or make calculations based on columns
171+
172+
### Database Design
173+
174+
- Every row-column combination contains a single *atomic* value, i.e., not
175+
containing parts we might want to work with separately.
176+
- One field per type of information
177+
- No redundant information
178+
- Split into separate tables with one table per class of information
179+
- Needs an identifier in common between tables – shared column - to
180+
reconnect (known as a *foreign key*).
181+
182+
### Import
183+
184+
Before we get started with writing our own queries, we'll create our own
185+
database. We'll be creating this database from the three `csv` files
186+
we downloaded earlier. Close the currently open database (**File > Close Database**) and then
187+
follow these instructions:
188+
189+
1. Start a New Database
190+
- Click the **New Database** button
191+
- Give a name and click Save to create the database in the opened folder
192+
- In the "Edit table definition" window that pops up, click cancel as we will be importing tables, not creating them from scratch
193+
2. Select **File >> Import >> Table from CSV file...**
194+
3. Choose `surveys.csv` from the data folder we downloaded and click **Open**.
195+
4. Give the table a name that matches the file name (`surveys`), or use the default
196+
5. If the first row has column headings, be sure to check the box next to "Column names in first line".
197+
6. Be sure the field separator and quotation options are correct. If you're not sure which options are correct, test some of the options until the preview at the bottom of the window looks right.
198+
7. Press **OK**, you should subsequently get a message that the table was imported.
199+
8. Back on the Database Structure tab, you should now see the table listed. Right click on the table name and choose **Modify Table**, or click on the **Modify Table** button just under the tabs and above the table list.
200+
9. Click **Save** if asked to save all pending changes.
201+
10. In the center panel of the window that appears, set the data types for each field using the suggestions in the table below (this includes fields from the `plots` and `species` tables also).
202+
11. Finally, click **OK** one more time to confirm the operation. Then click the **Write Changes** button to save the database.
203+
204+
| Field | Data Type | Motivation | Table(s) |
205+
| ----------------------------------------------------- | :----------------------- | ----------------------------------------------------------------------- | ---------------- |
206+
| day | INTEGER | Having data as numeric allows for meaningful arithmetic and comparisons | surveys |
207+
| genus | TEXT | Field contains text data | species |
208+
| hindfoot\_length | REAL | Field contains measured numeric data | surveys |
209+
| month | INTEGER | Having data as numeric allows for meaningful arithmetic and comparisons | surveys |
210+
| plot\_id | INTEGER | Field contains numeric data | plots, surveys |
211+
| plot\_type | TEXT | Field contains text data | plots |
212+
| record\_id | INTEGER | Field contains numeric data | surveys |
213+
| sex | TEXT | Field contains text data | surveys |
214+
| species\_id | TEXT | Field contains text data | species, surveys |
215+
| species | TEXT | Field contains text data | species |
216+
| taxa | TEXT | Field contains text data | species |
217+
| weight | REAL | Field contains measured numerical data | surveys |
218+
| year | INTEGER | Allows for meaningful arithmetic and comparisons | surveys |
219+
220+
::::::::::::::::::::::::::::::::::::::: challenge
221+
222+
### Challenge
223+
224+
- Import the `plots` and `species` tables
225+
226+
227+
::::::::::::::::::::::::::::::::::::::::::::::::::
228+
229+
You can also use this same approach to append new fields to an existing table.
230+
231+
### Adding fields to existing tables
232+
233+
1. Go to the "Database Structure" tab, right click on the table you'd like to add data to, and choose **Modify Table**, or click on the **Modify Table** just under the tabs and above the table.
234+
2. Click the **Add Field** button to add a new field and assign it a data type.
235+
236+
### Data types {#datatypes}
237+
238+
SQLite has four data types, shown in the table below.
239+
240+
| Data type | Description |
241+
| ------------------ | :-------------------------------------------------------------------------------------------------------------------- |
242+
| TEXT | Text string |
243+
| INTEGER | Integer (positive or negative whole number) |
244+
| REAL | Approximate numerical value (floating point number) |
245+
| BLOB | General data with no specfic type, stored in the database exactly as given (stands for _Binary Large OBject_) |
246+
247+
In addition to these four data types, SQLite has a NULL value for missing data. We will talk more about dealing with missing data in Episode 3.
248+
249+
:::::::::::::::::::::::::::::::::::::::: keypoints
250+
251+
- SQL allows us to select and group subsets of data, do math and other calculations, and combine data.
252+
- A relational database is made up of tables which are related to each other by shared keys.
253+
- Different database management systems (DBMS) use slightly different vocabulary, but they are all based on the same ideas.
254+
255+
::::::::::::::::::::::::::::::::::::::::::::::::::
256+
257+

0 commit comments

Comments
 (0)