Skip to content

Commit ef901fa

Browse files
committed
update sql part 2
1 parent dda530f commit ef901fa

File tree

4 files changed

+346
-23
lines changed

4 files changed

+346
-23
lines changed

tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -356,4 +356,4 @@ bigrquery::dbDisconnect(bigquery_conn)
356356
+ An RStudio [guide](https://db.rstudio.com/getting-started/connect-to-database/) on how to connect to an existing database using the `odbc` and `DBI` packages.
357357
+ An Rstudio [guide](https://db.rstudio.com/databases/big-query/) on how to connect to Google BigQuery using `odbc` or `bigrquery`.
358358
+ The official [vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) to using `dbplyr` for performing SQL queries via the tidyverse API.
359-
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R .
359+
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.

tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_1.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
How to integrate SQL and R - Part 1
22
================
33
Erika Duan
4-
2022-01-26
4+
2022-01-30
55

66
- [Introduction to the relational
77
model](#introduction-to-the-relational-model)
@@ -467,4 +467,4 @@ bigrquery::dbDisconnect(bigquery_conn)
467467
and [part
468468
3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/)
469469
containing advanced tips for using `dbplyr` to query SQL databases
470-
in R .
470+
in R.

tutorials/p-sql_to_r_workflows/p-sql_to_r_workflows_part_2.Rmd

Lines changed: 144 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,14 @@ output:
1010

1111
```{r setup, include=FALSE}
1212
# Set up global environment ----------------------------------------------------
13-
knitr::opts_chunk$set(echo=TRUE, results='hide', fig.show='hold', fig.align='center')
13+
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, results='hide', fig.show='hold', fig.align='center')
1414
```
1515

16-
```{r load libraries, message=FALSE, warning=FALSE}
16+
```{r load libraries, message=FALSE}
1717
# Load required packages -------------------------------------------------------
1818
if (!require("pacman")) install.packages("pacman")
1919
pacman::p_load(here,
20+
dbplyr,
2021
tidyverse,
2122
odbc,
2223
DBI,
@@ -48,8 +49,8 @@ knitr::include_graphics("../../figures/p-sql_to_r_workflows-execution_order.svg"
4849

4950
In SQL, the sequence in which SQL code is written does not correspond to the sequence in which it is executed, which leads to common errors like the example below.
5051

51-
```{r, results='hold'}
52-
# Incorrect SQL code -----------------------------------------------------------
52+
```{r, sql select queries, results='hold'}
53+
# Incorrect SQL query ----------------------------------------------------------
5354
# odbc::dbSendQuery(
5455
# tsql_conn,
5556
# "
@@ -62,11 +63,11 @@ In SQL, the sequence in which SQL code is written does not correspond to the seq
6263
# WHERE hobbit_id <> 2
6364
#
6465
# -- this query will generate an error as WHERE is executed before SELECT and
65-
# -- the renaming of student_id to hobbit_id happens after WHERE
66+
# -- the renaming of student_id to hobbit_id only happens via SELECT
6667
# "
6768
# )
6869
69-
# Correct SQL code -------------------------------------------------------------
70+
# Correct SQL query ------------------------------------------------------------
7071
odbc::dbSendQuery(
7172
tsql_conn,
7273
"
@@ -78,26 +79,156 @@ odbc::dbSendQuery(
7879
FROM education.student
7980
WHERE student_id <> 2
8081
81-
-- student_id is first filtered via WHERE and lastly renamed to hobbit_id
82+
-- student_id is first filtered via WHERE and then renamed as hobbit_id via SELECT
8283
"
8384
) %>%
8485
odbc::dbFetch() %>%
8586
knitr::kable()
8687
```
8788

8889

89-
## Writing `GROUP BY` SQL queries
90+
## Writing SQL `JOIN` queries
9091

91-
## Writing SQL subqueries
92+
The `dplyr` syntax for joining tables is very similar to its corresponding SQL syntax. The concept of left joins, right joins, inner joins and full joins are shared across both languages. In SQL, `JOIN` is executed directly after `FROM` and it is best practice to aliasing (rename via `AS`) table names to specify the records to join.
93+
94+
**Note:** Always ensure that you are joining to at least one column containing unique Ids to prevent unexpected many-to-many joined results.
95+
96+
```{r sql join query, results='hold'}
97+
# Perform inner join using SQL query -------------------------------------------
98+
odbc::dbSendQuery(
99+
tsql_conn,
100+
"
101+
SELECT
102+
c.course_name,
103+
c.course_desc,
104+
p.platform_name,
105+
p.company_name
106+
107+
FROM education.course AS c
108+
INNER JOIN education.platform AS p
109+
ON c.platform_id = p.platform_id
110+
111+
WHERE is_active = 1 OR is_active IS NULL
112+
113+
-- select course name and description and their corresponding platform and
114+
-- company name for platforms that are active or NULL for is_active
115+
"
116+
) %>%
117+
odbc::dbFetch() %>%
118+
knitr::kable()
119+
```
120+
121+
The equivalent R `dplyr` syntax follows the execution order, rather than written order, of the SQL join query.
122+
123+
**Note:** When using `dbplyr`, the education schema needs to explicitly passed using the function `in_schema("schema", "table")` inside `tbl()`.
124+
125+
```{r dbplyr join query, results='hold'}
126+
# Perform inner join using R syntax --------------------------------------------
127+
tbl(tsql_conn, in_schema("education", "course")) %>%
128+
inner_join(tbl(tsql_conn, in_schema("education", "platform")),
129+
by = c("platform_id" = "platform_id")) %>%
130+
filter(is_active == 1 | is.na(is_active)) %>%
131+
select(course_name,
132+
course_desc,
133+
platform_name,
134+
company_name) %>%
135+
collect()
136+
```
137+
138+
139+
## Writing SQL `GROUP BY` queries
140+
141+
In SQL, grouping by column(s) causes individual records to be grouped together as record tuples. Because SQL queries return atomic records, this is why `SELECT` can only be performed on the group by column(s) and aggregations of other columns.
142+
143+
```{r incorrect sql group by query}
144+
# Incorrect SQL query ----------------------------------------------------------
145+
# odbc::dbSendQuery(
146+
# tsql_conn,
147+
# "
148+
# SELECT
149+
# p.platform_name,
150+
# COUNT(course_id) as total_courses,
151+
# c.course_id
152+
#
153+
# FROM education.course AS c
154+
# INNER JOIN education.platform AS p
155+
# ON c.platform_id = p.platform_id
156+
#
157+
# GROUP BY p.platform_id, p.platform_name
158+
#
159+
# -- this query will generate an error as course_id is no longer an atomic
160+
# -- record once grouped by platform_id and platform_name
161+
# "
162+
# )
163+
```
92164

165+
In SQL, `WHERE` and `HAVING` are separate filtering methods as `WHERE` is executed first across individual records, before the `GROUP BY` statement. `HAVING` is executed after `GROUP BY` to enable filtering across individual grouped records and therefore requires an aggregation as its input.
166+
167+
As `SELECT` is executed last, this also means that the `SELECT` statement can only refer to the column(s) being grouped and aggregations of other columns.
168+
169+
```{r sql group by query, results='hold'}
170+
# Perform group by and aggregate SQL query -------------------------------------
171+
odbc::dbSendQuery(
172+
tsql_conn,
173+
"
174+
SELECT
175+
p.platform_id,
176+
p.platform_name,
177+
COUNT(course_id) as total_courses,
178+
AVG(course_length) AS avg_course_length,
179+
MIN(course_length) AS min_course_length,
180+
MAX(course_length) AS max_course_length
181+
182+
FROM education.course AS c
183+
INNER JOIN education.platform AS p
184+
ON c.platform_id = p.platform_id
185+
186+
WHERE course_length IS NOT NULL
187+
188+
GROUP BY p.platform_id, p.platform_name
189+
190+
HAVING COUNT(course_id) > 1
191+
192+
-- join course and platform tables, filter out courses without a course length
193+
-- and then group records by platform_id and platform_name
194+
-- filter to exclude platforms which only have one course
195+
-- finally select plaform name, total course count, average course length, min
196+
-- course length and max course length
197+
"
198+
) %>%
199+
odbc::dbFetch() %>%
200+
knitr::kable()
201+
```
202+
203+
The equivalent R `dplyr` syntax uses `filter()` before and after `group_by()` and aggregations are performed inside `summarise()`. R also allows `mutate()` to be used following `group_by()` to create a new column that relies on group-based transformations outputted across all individual records.
204+
205+
```{r dbplyr group by query, results='hold'}
206+
# Perform group by and aggregate using R syntax --------------------------------
207+
tbl(tsql_conn, in_schema("education", "course")) %>%
208+
inner_join(tbl(tsql_conn, in_schema("education", "platform")),
209+
by = c("platform_id" = "platform_id")) %>%
210+
filter(!is.na(course_length)) %>%
211+
group_by(platform_id, platform_name) %>%
212+
summarise(total_courses = n_distinct(course_id),
213+
avg_course_length = mean(course_length),
214+
min_course_length = min(course_length),
215+
max_course_length = max(course_length)) %>%
216+
filter(total_courses > 1) %>%
217+
ungroup() %>%
218+
collect()
219+
```
220+
221+
222+
## Writing SQL `WINDOW` functions
223+
224+
225+
## Writing SQL subqueries
93226

94227

95228
# Production friendly workflows
96229

97230

98231
# Other resources
232+
+ A great series of blog posts by Vebash Naidoo, with [part 1](https://sciencificity-blog.netlify.app/posts/2020-12-12-using-the-tidyverse-with-databases/), [part 2](https://sciencificity-blog.netlify.app/posts/2020-12-20-using-the-tidyverse-with-dbs-partii/) and [part 3](https://sciencificity-blog.netlify.app/posts/2020-12-31-using-tidyverse-with-dbs-partiii/) containing advanced tips for using `dbplyr` to query SQL databases in R.
99233
+ Emily Riederer's excellent [blog post](https://emilyriederer.netlify.app/post/sql-r-flow/) containing ideas for creating flexible SQL <> R project workflows.
100-
+ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)
101-
102-
103-
234+
+ Julia Evan's entertaining [programming zine](https://wizardzines.com/zines/sql/) explaining all things SQL. (Paywalled resource)

0 commit comments

Comments
 (0)