intro_to_sql/week2.qmd at main · fhdsl/intro_to_sql · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
---
title: "Week 2: `JOIN`s, More `WHERE`, Boolean Logic, `ORDER BY`"
---

## Connecting to our database

Let's connect to our database.

```{r}
#| context: setup
library(duckdb)
library(DBI)

con <- DBI::dbConnect(duckdb::duckdb(),
                      "data/GiBleed_5.3_1.1.duckdb")
```

## Table References

In single table queries, it is usually unambiguous to the query engine which column and which table you need to query.

However, when you involve multiple tables, it is important to know how to refer to a column in a specific table.

For example, the `procedure_occurrence` table has a `person_id` column as well. If we want to use this specific column in this table, we can use the `.` (dot) notation:

```
procedure_occurrence.person_id
```

If we wanted the `person_id` column in `person` we can use this:

```
person.person_id
```

This will become much more important as we get into `JOIN`ing tables.

### Check on Learning

Add table references to the `WHERE` part of the query:

```{sql}
#| connection: "con"
#| eval: false
SELECT *
  FROM procedure_occurrence
  WHERE person_id = 1
```

Let's get ready to work on queries involving multiple tables.

## Entity-relationship diagrams

Joining tables require understanding the relationship between tables in a database. This is often visualized via an **entity-relationship diagram:**

![](img/omop1.png)

Each rectangle represent a table, and within each table are the columns (fields). I am only showing a subset of the columns based on what we have explored so far in class. The connecting lines shows that there are shared values between tables in those columns, which helps one navigate between tables:

-   In the `person` table, the elements of the column `person_id` overlaps with the elements of `person_id` column in in the table `procedure_occurrence`.

-   In the `procedure_occurrence` table, the elements of the column `procedure_concept_id` overlaps with the elements of `concept_id` column in the table `concepts`.

We should consider to what degree the values overlap:

-   For each `person_id` in the `person` table, there may be duplicated `person_id`s in `procedure_occurrence` table, as a patient can have multiple procedures. This is a **one-to-many relationship**.

-   Multiple elements of `procedure_concept_id` in the `procedure_occurrence` table may correspond to a single element of `concept_id` in the "concept" table. This is a **many-to-one relationship**.

-   You can also have a **one-to-one relationship**.

The database we've been using has been rigorously modeled using a data model called [OMOP CDM (Common Data Model)](https://ohdsi.github.io/CommonDataModel/index.html). OMOP is short for Observational Medical Outcomes Partnership, and it is designed to be a database format that standardizes data from systems into a format that can be combined with other systems to compare health outcomes across organizations. The full OMOP entity relationship diagram can be [found here](https://ohdsi.github.io/CommonDataModel/cdm54erd.html).

Now, let's join some tables.

## `JOIN`

We use the `JOIN` clause when we want to combine information from two tables. Here we are going to combine information from two tables: `person` and `procedure_occurrence`.

To set the stage, let's show two tables, `x` and `y`. We want to join them by the keys, which are represented by colored boxes in both of the tables.

Note that table `x` has a key ("3") that isn't in table `y`, and that table `y` has a key ("4") that isn't in table `x`.

![](img/original-dfs.png)

We are going to explore `INNER JOIN` first. In an `INNER JOIN`, we pick out a column from each table in which its elements are going to be matched. In this case, we only retain rows that have elements that exist in both the `x` and `y` tables. We drop all rows that don't have matches in both tables.

![](img/inner-join.gif)There are other types of joins when we want to retain information from the `x` table or the `y` table, or both.

## `INNER JOIN` syntax

Here's an example where we are joining `person` with `procedure_occurrence`:

```{sql}
#| connection: "con"
SELECT person.person_id, procedure_occurrence.procedure_occurrence_id
    FROM person
    INNER JOIN procedure_occurrence
    ON person.person_id = procedure_occurrence.person_id
```

What's going on here? The magic happens with this clause, which we use to specify the two tables we need to join.

```
FROM person
  INNER JOIN procedure_occurrence
```

The last thing to note is the `ON` statement. These are the conditions by which we merge rows. We are taking one column in `person`, the `person_id`, and matching the rows up with those rows in `procedure_occurrence`'s own `person_id` column:

```
ON person.person_id = procedure_occurrence.person_id
```

## Aliases

As your queries get more complex, and as you involve more and more tables, you will need to use **aliases**. I think of them like "nicknames" - they can save you a lot of typing.

Here is the same query using aliases. We use `p` as an alias for `person` and `po` as an alias for `procedure_occurrence`. You can see it is a little more compact.

```{sql}
#| connection: "con"
SELECT p.person_id, po.procedure_occurrence_id
    FROM person AS p
    INNER JOIN procedure_occurrence AS po
    ON p.person_id = po.person_id
```

### Another example

Here, I use table aliasing in two different places: in my `COUNT`, and in my `WHERE`:

```{sql}
#| connection: "con"
SELECT COUNT(p.person_id)
  FROM person AS p
  WHERE p.year_of_birth < 2000;
```

Some people don't use `AS`, just putting the aliases next to the original name:

```{sql}
#| connection: "con"
SELECT COUNT(p.person_id)
  FROM person p
  WHERE p.year_of_birth < 2000;
```

We can also rename variables using `AS`:

```{sql}
#| connection: "con"
SELECT COUNT(person_id) AS person_count
  FROM person
  WHERE year_of_birth < 2000;
```

## `LEFT JOIN`

Jargon alert: The table to the **left** of the `JOIN` clause is called the **left table**, and the table to the **right** of the `JOIN` clause is known as the **right table**. This will become more important as we explore the different join types.

```
FROM procedure_occurrence INNER JOIN concept
      ^^Left Table                    ^^Right Table
```

What if we want to retain all of the rows in the `procedure_occurrence` table, even if there are no matches in the `concept` table? We can use a `LEFT JOIN` to do that.

![](img/left-join.gif)

If a row exists in the left table, but not the right table, it will be replicated in the joined table, but have rows with `NULL` columns from the right table.

Here is another example:

![](img/Slide4.jpeg)We can see the difference between a `INNER JOIN` and `LEFT JOIN` by counting the number of rows kept after joining:

```{sql}
#| connection: "con"
SELECT COUNT (*)
    FROM person as p
    INNER JOIN procedure_occurrence as po
    ON p.person_id = po.person_id
```

```{sql}
#| connection: "con"
SELECT COUNT (*)
    FROM person as p
    LEFT JOIN procedure_occurrence as po
    ON p.person_id = po.person_id
```

This suggests that there are some unique `person_id`s in `person` table not found in the `person_id` of `procedure_occurrence` table.

## Other kinds of `JOIN`s

-   The `RIGHT JOIN` is identical to `LEFT JOIN`, except that the rows preserved are from the *right* table.
-   The `FULL JOIN` retains all rows in both tables, regardless if there is a key match.
-   `ANTI JOIN` is helpful to find all of the keys that are in the *left* table, but not the *right* table

## Multiple `JOIN`s with Multiple Tables

![](img/omop1.png)

Suppose that we want a table with `person.person_id`, `procedure_occurrence.procedure_occurrence_id`, and `concept.concept_name`. Looks like we need a triple join!

The way I think of these multi-table joins is to decompose them into two joins:

1.  We first `INNER JOIN` `person` and `procedure_occurrence`, to produce an output table
2.  We take this output table and `INNER JOIN` it with `concept`.

Give a try yourself:

```{sql connection="con"}
#| eval: false
SELECT person.person_id, procedure_occurrence.procedure_occurrence_id
  FROM
  INNER JOIN
  ON

```

Then, add the third join:

```{sql connection="con"}
#| eval: false
SELECT person.person_id, procedure_occurrence.procedure_occurrence_id, concept.concept_name
  FROM
  INNER JOIN
  ON
  INNER JOIN
  ON

```

Some tips: Notice that both of these `JOIN`s have separate `ON` statements. For the first join, we could have:

```
INNER JOIN procedure_occurrence AS po
  ON p.person_id = po.person_id
```

For the second `JOIN`, we could have:

```
INNER JOIN concept AS c
ON po.procedure_concept_id = c.concept_id
```

And that gives us the final table, which takes variables from all three tables.

One thing to keep in mind is that `JOIN`s are not necessarily commutative; that is, the order of joins can matter. This is because we may drop or preserve rows depending on the `JOIN`.

For combining `INNER JOIN`s, we are looking for the subset of keys that exist in each table, so join order doesn't matter. But for combining `LEFT JOIN`s and `RIGHT JOINS`, order *can* matter.

It's really important to check intermediate output and make sure that you are retaining the rows that you need in the final output. For example, I'd try the first join first and see that it contains the rows that I need before adding the second join.

Here is the solution:

```{sql connection="con"}
SELECT p.person_id, po.procedure_occurrence_id, c.concept_name
  FROM person AS p
  INNER JOIN procedure_occurrence AS po
  ON p.person_id = po.person_id
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
```

## Using `JOIN` with `WHERE`

Where we really start to cook with gas is when we combine `JOIN` with `WHERE`. Let's add an additional `WHERE` where we only want those rows that have the `concept_name` of 'Subcutaneous immunotherapy\`:

```{sql}
#| connection: "con"
SELECT p.person_id, po.procedure_occurrence_id, c.concept_name
  FROM person AS p
  INNER JOIN procedure_occurrence AS po
  ON p.person_id = po.person_id
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  WHERE c.concept_name = 'Subcutaneous immunotherapy';

```

Or keeping rows where the year of birth is before 1980:

```{sql}
#| connection: "con"
SELECT p.person_id, p.year_of_birth, po.procedure_occurrence_id, c.concept_name
  FROM person AS p
  INNER JOIN procedure_occurrence AS po
  ON p.person_id = po.person_id
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
    WHERE p.year_of_birth < 1980;
```

## `WHERE` vs `ON`

You will see variations of SQL statements that eliminate `JOIN` and `ON` entirely, putting everything in `WHERE`:

```{sql}
#| connection: "con"
SELECT po.person_id, c.concept_name
    FROM procedure_occurrence as po, concept as c
    WHERE c.concept_name = 'Subcutaneous immunotherapy'
    AND po.procedure_concept_id = c.concept_id
    LIMIT 10;
```

I'm not the biggest fan of this, because it is often not clear what is a filtering clause and what is a joining clause, so I prefer to use `JOIN`/`ON` with a `WHERE`.

## Revisiting `WHERE`: `AND` versus `OR`

Revisiting `WHERE`, we can combine conditions with `AND` or `OR`.

`AND` is always going to be more restrictive than `OR`, because our rows must meet two conditions.

```{sql}
#| connection: "con"
SELECT COUNT(*)
  FROM person
  WHERE year_of_birth < 1980
  AND gender_source_value = 'M'
```

On the other hand `OR` is more permissive than `AND`, because our rows must meet only one of the conditions.

```{sql}
#| connection: "con"
SELECT COUNT(*)
  FROM person
  WHERE year_of_birth < 1980
  OR gender_source_value = 'M'
```

There is also `NOT`, where one condition must be true, and the other must be false.

```{sql}
#| connection: "con"
SELECT COUNT(*)
  FROM person
  WHERE year_of_birth < 1980
  AND NOT gender_source_value = 'M'
```

## `ORDER BY`

`ORDER BY` lets us sort tables by one or more columns:

```{sql}
#| connection: "con"
SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date
    FROM person as p
    INNER JOIN procedure_occurrence as po
    ON p.person_id = po.person_id
    ORDER BY p.person_id;
```

Once we sorted by `person_id`, we see that for every unique `person_id`, there can be multiple procedures! This suggests that there is a **one-to-many relationship** between `person` and `procedure_occurrence` tables.

## Try it Out

We can `ORDER BY` multiple columns at once. Try ordering by `p.patient_id` and `po.procedure_date`:

```{sql}
#| connection: "con"
#| eval: false
SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date
    FROM person as p
    INNER JOIN procedure_occurrence as po
    ON p.person_id = po.person_id
    ORDER BY ----, ----
```

## Constraints and rules for Databases

So far, we've only queried data, but not added data to databases.

As we've stated before, DuckDB is an Analytical database, not a Transactional one. That means it prioritizes reading from data tables rather than inserting into them. Transactional databases, on the other hand, can handle multiple inserts from multiple users at once. They are made for *concurrent* transactions.

We are not going to look at how to add to a database in this course, but we are going to examine what the *constraints* can be placed on a database, because this gives rules on what is allowed in our database to be queried.

When one sets up a database, we also set up the constraints via a *Data Definition Language* for our tables:

``` sql
CREATE TABLE @cdmDatabaseSchema.PERSON (
            person_id integer NOT NULL,
            gender_concept_id integer NOT NULL,
            year_of_birth integer NOT NULL,
            month_of_birth integer NULL,
            day_of_birth integer NULL,
            birth_datetime TIMESTAMP NULL,
            race_concept_id integer NOT NULL,
            ethnicity_concept_id integer NOT NULL,
            location_id integer NULL,
            provider_id integer NULL,
            care_site_id integer NULL,
            person_source_value varchar(50) NULL,
            gender_source_value varchar(50) NULL,
            gender_source_concept_id integer NULL,
            race_source_value varchar(50) NULL,
            race_source_concept_id integer NULL,
            ethnicity_source_value varchar(50) NULL,
            ethnicity_source_concept_id integer NULL );
```

We've encountered one constraint: database fields (columns) need to be *typed*. For example, id keys are usually `INTEGER`. Names are often `VARCHAR`.

Here are some other constraints that can be applied to a field (column):

-   `NOT NULL` - no values can have a `NULL` value.
-   `UNIQUE` - all values must be unique.
-   `PRIMARY KEY` - `NOT NULL` and `UNIQUE`.
-   `FOREIGN KEY` - value must exist as a primary key in another table's field. The referenced table's field must be specified.
-   `CHECK` - check the data type and conditions. One example would be our data shouldn't be before 1900.
-   `DEFAULT` - default values are given if not provided.

The most important constraints to know about are `PRIMARY KEY` and `FOREIGN KEY`. A `PRIMARY KEY` is required for any table, and cannot be `NULL` and must be unique. This gives an unique id for each entry of the table.

When we create tables in our database, we need to specify which column is a `PRIMARY KEY`:

``` sql
CREATE TABLE person (
  person_id INTEGER PRIMARY KEY
)
```

`FOREIGN KEY` involves two or more tables. If a column is declared a `FOREIGN KEY`, then that key value must *exist* in a `REFERENCES` table as a primary key. Here, when we create `procedure_occurrence`, `person_id` column `REFERENCES` the table `person`'s `person_id` primay key column, and `procedure_concept_id` column `REFERENCES` the table `concept`'s `concept_id` primary key column.

``` sql
CREATE TABLE procedure_occurrence {
  procedure_occurrence_id PRIMARY KEY,
  person_id INTEGER REFERENCES person(person_id)
  procedure_concept_id INTEGER REFERENCES concept(concept_id)
}
```

Thus, we can use constraints to make sure that our database retains its integrity when we add rows to it.

You can see an example of constraints for our database here: <https://github.com/OHDSI/CommonDataModel/blob/v5.4.0/inst/ddl/5.4/postgresql/OMOPCDM_postgresql_5.4_constraints.sql>.

## Always close the connection

When we're done, it's best to close the connection with `dbDisconnect()`.

```{r}
dbDisconnect(con)
```

## References

-   [Tidyexplain](https://github.com/gadenbuie/tidyexplain) - All `JOIN` animations come from here.
-   [Understanding Joins](https://r4ds.had.co.nz/relational-data.html#understanding-joins) - another nice visual explanation from R for Data Science.